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Abstract: This work applies the theory of knowledge in distributed systems to the 
design of efficient fault-tolerant protocols. We define a large class of problems requiring 
coordinated, simultaneous action in synchronous systems, and give a method of trans- 
forming specifications of such problems into protocols that are optimal tn all runs: for 
every possible input to the system and faulty processor behavior, these protocols are 
guaranteed to perform the simultaneous actions as soon as any other protocol could 
possibly perform them. This transformation is performed in two steps. In the first step, 
we extract directly from the problem specification a high-level protocol programmed us- 
ing explicit tests for common knowledge. In the second step, we carefully analyze when 
facts become common knowledge, thereby providing a method of efficiently implement- 
ing these protocols in many variants of the omissions failure model. In the generalized 
omissions model, however, our analysis shows that testing for common knowledge is 
NP-hard. Given the close correspondence between common knowledge and simultane- 
ous actions, we are able to show that no optimal protocol for any such problem can 
be computationally efficient in this model. The analysis in this paper exposes many 
subtle differences between the failure models, including the precise point at which this 
gap in complexity occurs. 
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1 Introduction 


The problem of ensuring proper coordination between processors in distributed systems 
whose components are unreliable is both important and difficult. There are generally 
two aspects to such coordination: the actions the different processors perform, and the 
relative timing of these actions. Both aspects are crucial, for instance, in maintaining 
consistent views of a distributed database. In particular, it is often most desirable to 
perform coordinated actions simultaneously at different sites of asystem. It is therefore 
of great interest to study the design of protocols involving simultaneous actions, actions 
performed simultaneously by all processors whenever they are performed at all. 


This paper presents a novel approach to the design of fault-tolerant protocols. We 
begin by defining a general class of simultaneous choice problems, a class intended 
to capture the essence of simultaneous coordination in synchronous systems. Many 
well-known problems, such as simultaneous Byzantine agreement, distributed firing 
squad, etc., can be formulated as such problems. We then study the design of efficient 
protocols for such problems in a number of variants of the omtsstons failure model 
(cf. [MSF]). Given any satisfiable specification of a simultaneous choice problem, we 
derive a protocol for the problem with the unique property of being optimal tn all runs: 
For every possible input to the system and faulty processor behavior, this protocol is 
guaranteed to perform the simultaneous actions as soon as they would be performed by 
any other protocol for the problem. (We will use optimal as shorthand for optimal in all 
runs.) In contrast, previous protocols for such problems do not adapt their behavior on 
the basis of faulty processor behavior, and hence always perform as poorly as they do in 
their worst case run. A general method of obtaining optimal protocols for simultaneous 
problems in the simpler crash failure model is implicit in the work of Dwork and Moses 
(cf. [DM]), which provided the original motivation for this work. 


Our approach is based on the close relationship between knowledge, communica- 
tion, and action in distributed systems: A number of recent works (cf. [HM], [DM], 
{Mo]) show that simultaneous actions are closely related to common knowledge. Infor- 
mally, a fact is common knowledge if it is true, everyone knows it, everyone knows that 
everyone knows it, and so on ad infinitum. Notice that every processor performing 
a simultaneous action knows the action is being performed. In addition, since such 
actions are performed simultaneously by all processors, every processor knows that 
all processors know the action is being performed. This argument can be formalized 
and extended to show that when a simultaneous action is performed, it is common 
knowledge that the action is being performed. Consequently, a necessary condition for 
performing simultaneous actions is attaining common knowledge of particular facts. 
Interestingly, our work shows that in a precise sense this is also a sufficient condition: 
The problem of performing simultaneous actions reduces to the problem of attaining 
common knowledge of particular facts. 


In deriving optimal protocols for simultaneous choice problems, we make explicit 
and direct use of the relationship between common knowledge and simultaneous actions. 
The derivation proceeds in two stages. In the first stage, we program the optimal 
protocols in a high-level language where processors’ actions depend on explicit tests for 
common knowledge of certain facts. These high-level protocols are extracted directly 
from the problem specifications via a few simple manipulations. The second stage 
deals with effectively implementing these tests for common knowledge. We give a 
direct implementation of such tests in all variants of the omissions failure model we 
consider. As a result, our high-level protocols have effective implementations in these 
failure models as low-level, standard protocols that are optimal in all runs. 


Consider, for example, the following version of the distributed firing squad problem 
(cf. [BL], [CDDS], [R]): An external source may send “start” signals to some of the 
processors in the system at unpredictable, possibly different, times. It is required that 
(i) if any nonfaulty processor receives a “start” signal, then all nonfaulty processors 
perform an irreversible “firing” action at some later point, (ii) whenever any nonfaulty 
processor “fires,” all nonfaulty processors do so simultaneously, and (iii) if no processor 
receives a “start” signal, then no nonfaulty processor “fires.” The high-level protocol 
we derive for this problem in the omissions model requires all processors to act as 
follows: 


repeat every round 
send current view to every processor 
until it is common knowledge that 


some processor received a “start” signal; 
“fire” and halt. 


Since we exhibit an effective implementation of the test for common knowledge em- 
bedded in this protocol, this high-level protocol can be transformed into a standard 
protocol that is optimal in all runs. No previous protocol for this problem suggested in 
the literature is optimal in all runs. Furthermore, in many cases this protocol “fires” 
much earlier than any other known protocol for this problem: In some cases, this 
protocol “fires” as soon as one round after the first “start” signal is received. 


We show that optimal protocols for simultaneous choice problems can always be 
implemented in a communication efficient way, in all variants of the omissions model 
we consider. However, our direct implementation of tests for common knowledge is 
not computationally efficient: It requires processors to perform exponential time com- 
putations between consecutive rounds of communication. One of the major technical 
contributions of this paper is a method of efficiently implementing tests for common 
knowledge in several variants of the omissions failure model. In the standard omissions 
model, we provide a clean and concise method of efficiently implementing tests for com- 
mon knowledge. The analysis underlying this method reveals the basic combinatorial 


structure underlying the omissions model, as well as crisply characterizing the set of 
facts that can be common knowledge at any point in the execution of a protocol. In 
the recetving omissions model, in which faulty processors may fail to receive messages 
rather than to send messages, testing for common knowledge is shown to be trivial. 
This exposes a significant difference between two seemingly symmetric failure models. 


We are not able to efficiently implement tests for common knowledge in the gen- 
eralized omtsstons model, in which faulty processors may fail both to send and to 
receive messages. In fact, we show that testing for common knowledge in this model 
in NP-hard. As a result, using the close relationship between common knowledge and 
simultaneous actions, we are able to show that no optimal protocol for any reasonable 
simultaneous choice problem can be computationally efficient unless P=NP. In partic- 
ular, in this model there can be no computationally-efficient optimal protocol for the 
distributed firing squad problem stated above, for simultaneously performing Byzan- 
tine agreement (cf. [PSL], [DM]), and for most any other simultaneous problem. We 
consider another variant of the omissions model, called generalized omtisstons with in- 
formation, in which it is assumed that the intended receiver of an undelivered message 
can test (and therefore knows) whether it or the sender is at fault. We show that the 
techniques used in the standard omissions model extend to this model as well, yielding 
computationally-efficient optimal protocols. As a result, we see that optimal protocols 
for simultaneous choice problems are computationally intractable in the generalized 
omissions model precisely because of the fact that in this model undelivered messages 
do not uniquely determine the set of faulty processors. 


Thus, we show how to derive efficient optimal protocols in the omissions model, 
and we show that optimal protocols are intractable in the generalized omissions model. 
Since it is unrealistic to expect conventional processors (limited to polynomial-time 
computation) to follow such intractable protocols, it becomes becomes interesting to 
ask how well resource-bounded processors can perform simultaneous actions in the 
generalized omissions model. Analyzing this problem will require extending the theory 
of knowledge in distributed systems to account for the restricted computational power 
of such processors. Such an extension should give rise to notions of resource-bounded 
knowledge and common knowledge that closely correspond to the ability of resource- 
bounded processors to perform simultaneous actions. The need for a theory of resource- 
bounded knowledge has already been demonstrated, primarily by problems in which 
computational complexity is introduced by restricting the computational power of the 
adversary, thus allowing solutions involving encryption. This work, however, provides 
a more compelling indication of the need for such a theory, even for the analysis of 
simple problems in distributed computation that do not make such assumptions about 
the adversary. 


Since the role of knowledge in the design and analysis of protocols can be understood 
without delving into the details of this work, we suggest that a first reading of this 


paper be based solely on the text and statements of results. The paper is organized 
as follows: Section 2 defines the model of distributed systems used in the paper, and 
Section 3 contains precise definitions of notions of knowledge in such a system. In 
Section 4 we define the notion of a stmultaneous choice problem, a large class of problems 
involving coordinated simultaneous actions. Section 5 presents a uniform method of 
deriving an optimal high-level protocol from the specification of a simultaneous choice 
problem, using explicit tests for common knowledge. Section 6 deals with the problem 
of efficiently implementing tests for common knowledge of facts relevant to simultaneous 
choice problems. The analysis in Section 6 reveals interesting properties of the different 
failure models, and exposes fine distinctions between them. Finally, Section 7 contains 
some concluding remarks. 


2 Model of a System 


This section introduces a model of the distributed systems with which this paper is 
concerned. Our treatment extends and is closely related to that of [DM]. 


We consider synchronous systems of unreliable processors. Such a system consists 
of a finite collection P = {p,,..., Dn} of n processors (n > 2), each pair of which is 
connected by a two-way communication link. Processors share a discrete global clock! 
that starts at time 0 and advances in increments of one. Communication in the system 
proceeds in a sequence of rounds, with round k taking place between time k — 1 and 
time k. Between rounds of communication a processor may perform local computation 
and other internal actions. A processor starts in some initial state at time 0. Then, in 
every following round, the processor first sends a set of messages to other processors, 
and then receives messages sent to it by other processors during the same round. In 
addition, a processor may also receive requests for service from clients external to 
the system (think, for example, of a distributed airline reservation system). Actions 
resulting from the servicing of such requests may take a variety of forms, including the 
initiation of various activities within the system by sending certain messages to other 
processors in later rounds. Each message is assumed to be tagged with the identities of 
the sender and intended receiver of the message, as well as the round in which it is sent; 
similarly for each request. At any given time, a processor’s message history consists of 
the list of messages it has received from the other processors, and a processor’s input 
history consists of its initial state together with the requests it has received from the 
system’s external clients. A processor’s view at any given time consists of its message 
history, its input history, the time on the global clock, and the processor’s identity. For 


1We assume the existence of a shared global clock for ease of exposition. The analysis performed in 
this paper applies even in synchronous systems in which processors have local clocks and start operating 
in an arbitrarily staggered order. 


technical reasons, it will be convenient to talk about processors’ views at negative times 


(before time 0). A processor’s view at a negative time is defined to be a distinguished 
empty view. 


We think of the processors as following a protocol, which specifies exactly what 
messages each processor is required to send during a round (and what other actions 
the processor is required to take) as a deterministic function of the processor’s view. 
Notice that processors must compute this function by following some algorithm. Thus, 
while we formally define a protocol to be a function, it is convenient to maintain both 
views of a protocol as a function and an algorithm. While a protocol determines the 
behavior of each processor (as a function of its view), processors are unreliable and 
some of them may be faulty, the rest being nonfaulty. Both faulty and nonfaulty 
processors faithfully follow the protocol, their behaviors differing only in the messages 
they successfully send and receive.2?_ A nonfaulty processor sends every message it 
is required by the protocol to send, and receives every message sent to it by other 
processors, in all rounds of communication. A faulty processor, however, may fail to 
send or receive certain messages; a processor is said to fatl during a given round if 
it fails to send or receive a message during that round. We will consider a number of 
different processor failure models: (i) the omtssions model (cf. [MSF]), in which a faulty 
processor receives every message sent to it, but sends only an arbitrary (not necessarily 
strict) subset of the messages it is required to send; (ii) the receiving omissions model, 
in which a faulty processor sends every message it is required to send, but receives only 
an arbitrary subset of the messages sent to it; (iii) the generalized omissions model, 
in which a faulty processor may both send only an arbitrary subset of the messages 
it is required to send and receive only an arbitrary subset of the messages sent to 
it; and (iv) generalized omissions with information, which differs from the generalized 
omissions model in that a processor not receiving a message from another processor 
can determine whether it or the sender is at fault. 


An infinite execution of a protocol in a system is called a run of the protocol. We 
identify a run with the complete history of events that take place during the run, from 
time O until the end of time. This includes each processor’s complete input history, 
complete message history, and, if the processor is faulty in the run, a description of 
its behavior during each round (formalized in the following paragraph). A pair (p,2), 
where p is a run and @ is a natural number, is called a point, and represents the state 
of the system after the first 2 rounds of p. We denote processor q’s view at the point 
(p,£) by v(q,p, 2). 


We now define the notion of a failure pattern, a formal description of faulty pro- 
cessor behavior during a run. The notion of a failure pattern in each variant of the 


2Intuitively, processors attempt to send and receive all required messages. Failures are caused by 
faulty input/output ports. However, we will often speak of processors failing to send or receive a given 
message when we mean that the message was not successfully sent or received, respectively. 


omissions model is a suitable restriction of this general definition. Remember that a 
faulty processor may fail to send or receive certain messages. It is therefore natural to 
define the faulty behavior of a processor p to be a pair of functions S and R mapping 
round numbers to sets of processors. Intuitively, these are the processors to which p 
fails to send and receive messages, respectively, during each round. The processor p is 
said to display this faulty behavior during a given run if in every round k processor p 
sends no messages to processors in S(k) but sends all required messages to processors 
not in S(k), and receives no messages from processors in R(k) but receives all messages 
sent to it by processors not in R(k). The fatlure pattern of a run is a set of pairs 
(pi, (S;, R;)) consisting of a processor and a faulty behavior, such that the processors 
appearing in the failure pattern are exactly those that are faulty in the run, and each 
displays the corresponding faulty behavior. Given a run p, if 4; is the complete input 
history of processor p; in p, then we say that 7 = (71,---,%n) is the (external) input 
to p. A pair (7,7), where 7 is a failure pattern and ¥ is an input, is called an operating 
environment. Notice that a run is uniquely determined by a protocol and an operating 
environment. Two runs of two different protocols are said to be corresponding runs if 
they have the same operating environment. The fact that an operating environment is 
independent of the protocol will allow us to compare different protocols according to 
their behavior in corresponding runs. 


In this work, we study the behavior of protocols in the presence of a bounded 
number of failures (of a particular type) and a given setting of possible inputs. It 
is therefore natural to identify a system with the set of all possible runs of a given 
protocol under such circumstances. Formally, a system is identified with the set of 
runs of a protocol P by n > 2 processors, at most t < n — 2 of which may be faulty 
(in the sense of a particular failure model M), where the complete input history of 
each processor p; is an element of a set T;. We denote this set of runs by the tuple 
x = (n,t,P,M,T1,...,0n). Our definition of a system ensures that the input to the 
system is orthogonal to, and hence carries no information about, the failure pattern. In 
addition, since the set of possible inputs in the system has the form I; x --- X In, one 
processor’s input contains no information about any other processor’s input, and hence 
the only way in which processors obtain information about other processor’s input is 
via messages communicated between the processors in the system. 


While a protocol may be thought of as a function of processors’ views, protocols 
for distributed systems (as well as protocols for sequential and parallel computation) 
are typically written for systems of arbitrarily large size. In this sense, the actions 
and messages required of a processor by a protocol actually depend on the number 
of processors in the system (and perhaps the bound on the number of failures) as 
well as the view of the processor. Therefore, we formally define a protocol to be a 
function from n, t, and a processor’s view to a list of actions the processor is required 
to perform, followed by a list of messages the processor is required to send in the 


following round. Since each protocol is defined for systems of arbitrary size, it is natural 
to define a class of systems to be a collection of systems {X(n,t) :n >t+2 > 0}, where 
U(n,t) = (n,t,P,M,Ti,...,T,) for some fixed protocol P, failure model M, and input 
sets [;. 


3 Definition of Knowledge 


Our analysis makes essential use of reasoning about processors’ knowledge at various 
points in the execution of a protocol. This section contains precise definitions of the 
notions of knowledge we use. For the purpose of these definitions, we assume that a 
particular system, a set of runs as defined in the previous section, is fixed ahead of 
time. All runs mentioned will be runs of this system, and all points will be points in 
such runs. Our treatment is a modification of that of [DM] and [HM]. 


We assume the existence of an underlying logical language for representing all rel- 
evant ground facts — facts about the system that do not explicitly mention proces- 
sors’ knowledge (for example, “the value of register x ts 0”, or “processor p; failed in 
round 3”). Formally, a ground fact y will be identified with a set of points r(p). A 
ground fact is said to hold at a point (p,@), denoted (p,@) — 9, iff (p,2) € 7(). 
We will define various ground facts as we go along. The set of points corresponding to 
these facts will be clear from context. A fact is said to be valrd if it is true of all points 
in all systems. A fact is said to be valid in the system for a given system if it is true of 
all points in the system. 


We now define what facts a processor is said to “know” at any given point (p, @) in 
the system. Roughly speaking, a processor p; is said to know a fact if p is guaranteed 
to hold, given p;’s view of the run. More formally, we say p; knows y at (p,k), denoted 
(o,k) E Kip, if (o',k) & for all points (p',k) satisfying v(p;,p,k) = v(pi, p',k). 
This definition of knowledge is essentially the total view interpretation of [HM]. It is 
“external,” in the sense that a processor is ascribed knowledge based solely on the 
processor’s information, and not, say, on the local computation it performs or on its 
computational power. Notice that a processor’s knowledge at a given point depends on 
the system as well as on the specific run. Thus, implicit in the definition of (p,2) - » 
is the system relative to which knowledge is determined. Throughout the paper it will 
be clear from context what the relevant system should be whenever “|=” is used. 


We will find it useful to extend this definition of knowledge to sets of processors as 
well. The view of a set of processors G C P at (p,k), denoted v(G,p,k), is defined by: 
v(G, p,k) def {u(p,p,k) : pe G}. 


Thus, roughly speaking, G’s view is simply the joint view of its members. We say 
that the group G impltcitly knows — at (p,k), denoted (p,k) F Icy, if for all points 


(o’, k) satisfying v(G, p,k) = v(G, p',k) it is the case that (p’,k) /: y. In the particular 
case that G is the singleton set {p;}, the notions of J, and K; coincide. Intuitively, G 
implicitly knows » if the joint view of G’s members guarantees that ~ holds. Notice 
that if processor p knows and processor g knows y > w, then together they implicitly 
know 7, even if neither of them knows wy individually. The notion of implicit knowledge 
was first defined in [HM]. 


The notions of knowledge and implicit knowledge defined above are closely related 
to modal logics that have been extensively studied by philosophers (cf. [Hi]). We say 
that an operator M has the properties of the modal system S5 if it satisfies a) if y is 
valid in the system then M¢ is valid in the system; and the following formulas are valid: 
b) Mp y;c) (Mp AM(~ Dd ¥)) D My; d) Me D MM; and e) ~My D M-M yp. 
The definitions of knowledge and implicit knowledge given above immediately imply 
the following (cf. [HM2], [DM)): 


Proposition 1: The operators K; and IJ, each has the properties of S5. 


Finally, the state of common knowledge among a group of processors will be central 
to our analysis. It’s central role will result from the close correspondence between 
common knowledge among the members of a group and simultaneous actions performed 
by the group. Roughly speaking, as we mentioned in the introduction, a fact ¢ is 
common knowledge to a given group if y holds, everyone in the group knows , everyone 
knows that everyone knows gy, and so on ad infinitum. Formally defining common 
knowledge, however, must be done with great care. The problem is that the groups 
of interest are not always explicitly given as fixed subsets of P. For example, we will 
be most interested in facts that are common knowledge to the group WN of nonfaulty 
processors. In any given context (in this case, any given run), this group is a fixed 
set of processors. But the precise identity of N varies from one context to another. 
This motivates us to define common knowledge to a slightly more general notion of 
groups of processors: An tndezical set S$ of processors is a function mapping points 
to sets of processors. That is, S : (p,£) ++ S(p,@), where S(p,2) C P. The notion of 
an indexical set is a direct generalization of the notion of a fixed set of processors. 
In particular, we can identify a fixed set of processors with a constant indexical set. 
The group WN of nonfaulty processors, the group P of all processors, the group of all 
processors that haven’t displayed faulty behavior by the current time, and many other 
groups of interest are all indexical sets of processors. 


The first step in defining common knowledge to a given group of processors is to 
determine what it means for everyone in the group to know a fact. For a fixed set G, 


“everyone in G knows y,” denoted E,y, is customarily defined by Epp = A K; 
pPiEG 
(cf. [HM]). In extending this notion to indexical sets, however, a subtle decision must 


be made. The immediate generalization of this definition is to define E;¢ = a Kip. 


This generalization, however, does not yield a notion of common knowledge that  iaely 
corresponds to S’s ability to perform simultaneous actions (see Lemma 4 below). Given 
that G is a fixed set, and that the knowledge operator K; satisfies property (a) of S5 
given above, it follows that p; € G = K;(p; € G) is valid. Therefore, an equivalent 


definition of E,y is Egy = A. Ki(p; € G Dy). We choose this form of “everyone 
pie 
knows” as the appropriate dance dated to indexical sets. Formally, given an indexical 


set $, we define FE, essentially corresponding to everyone in S knows ¢, by: 


Esp A Ki(peS D9). 
p,ES 
Roughly speaking, Ey holds exactly if every member of S$ knows that, if it is a member 
of S, then » holds. 


We now define ~ ts common knowledge to S, denoted Cs, by: 
Cep f PANE NE Esp N+ NE™DA-+. 


In other words, (p,£) - Cs iff both (p,2) K and for all m > 1 it is the case 
that (p,£) - Ep. Thus, roughly speaking, a fact is common knowledge if it is true, 
everyone knows it, everyone knows that everyone knows it, etc. The definitions of E; 
and C’,, directly generalize the standard notions from [HM] and [DM]. 


A useful tool for thinking about Ef’y and C, is an undirected graph whose nodes 
are the points of the system, in which two points (p,k) and (p',k) are connected by an 
edge iff some processor p that is a member of both S(p,k) and S(p’,k) has the same 
view at both (p,k) and (p’,k). This graph is called the similarity graph relative to S. 
For example, if $ is the set N of nonfaulty processors, two points are connected by an 
edge in the similarity graph iff there is a processor that is nonfaulty at both points, 
and has the same view at both points. An easy argument by induction on m shows 
that (p,k) = Ete iff (o',k) — ¢ for all points (p',k) of distance at most m from 
(e,&) in this graph. It follows that (p,k) E Cy iff (p',k) - ¢ for all points (p',k) in 
the connected component of (p,k). Two points (p,) and (’,@) are said to be stmilar 
relative to $, denoted (p,£) ~ (p',@), if they are in the same connected component of 
the similarity graph. Since the indexical set S$ is generally clear from context (most 
often being the set N of nonfaulty processors), we denote similarity by ~ without the 
superscript $. We thus have: 


Theorem 2: (p,k)  Cs¢ iff (p',k) / ¢ for all points (p’, k) satisfying (p,k) ~ (p', k). 
Our analysis will exploit this relationship between common knowledge and the similar- 


ity graph. The similarity graph will provide us with a useful combinatorial tool with 
which to study when facts become common knowledge. 
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One of the useful properties of common knowledge is the so-called “fixpoint axiom” 
(cf. [HM]) 
Crp = E;Csp, 


which states that common knowledge is a fixpoint of the E, operator (provided S$ is 
nonempty, as will invariably be the case in this work). It implies that a fact’s being 
common knowledge is in a sense “public:” a fact can be common knowledge to a group 
of processors only if all members of the group know that it is common knowledge. This 
axiom also implies that when a fact becomes common knowledge, it becomes common 
knowledge to all relevant processors simultaneously. Another useful fact about common 
knowledge is captured by the following induction rule: 


If ypoEse is valid in the system, 
then poDC;y is valid in the system. 


Roughly speaking, the induction rule implies that if a fact is “public” to a group 
of processors, in the sense that whenever it holds it is known to all members of the 
group, then whenever it holds it is in fact common knowledge. These are two essential 
properties of common knowledge that will prove useful to our analysis. In addition, we 
can also show the following: 


Proposition 3: The operator C,; has the properties of S5. 


According to our definitions, facts about the system are properties of points: they 
are either true or false at any given point. It is often useful to be able to refer to facts 
as being about things other than points (e.g., properties of runs). In general, a fact » 
is said to be a fact about X if fixing X determines the truth (or falsity) of y. For 
example, a fact ~ is said to be a fact about the input if fixing the input determines 
whether or not ~ holds. That is, for any given input 7, either » holds at all points 
(p, k) where p is a run with input 7, or y holds at no such point. The meaning of a fact 
being about the operating environment, about the existence of failures, about the first k 
rounds, etc., are similarly defined. 


4 Simultaneous Choice Problems 


In order to study in a uniform and general way the design of protocols for problems 
involving coordinated simultaneous action, a definition of this class of problems is 
required. Lacking a most general definition, we focus on the class of simultaneous 
choice problems, a large class of problems that capture the essence of such coordinated 
action in a distributed environment. Roughly speaking, these problems require that 
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one of a number of alternative actions be performed (or “chosen”) simultaneously by 
the nonfaulty processors, where for each action we are given conditions under which 
the action must be performed and conditions under which its performance is forbidden. 
In addition, the specification of such a problem also determines the possible operating 
environments, by specifying what inputs each processor may possibly receive and what 
types of processor failures are possible. 


Formally, a stmultaneous action a is an action having two associated conditions 
pro(a) and con(a), both facts about the operating environment. A simultaneous choice 
problem C is a problem determined by a set {a,,...,a@m} of simultaneous actions and 
their associated conditions, together with a failure model M, and a set I’; of complete 
input histories for each processor p;. Intuitively, we will require that every run p of a 
protocol implementing C satisfies the following conditions: 


(i) each nonfaulty processor performs at most one of the a;’s, 


(ii) any a; performed by some nonfaulty*® processor is performed simultaneously by 
all of them, 


(iii) a; is performed by all nonfaulty processors if p satisfies pro(a;), and 


(iv) a; is not performed by any nonfaulty processor if p satisfies con(a;). 


More formally, a protocol P and the simultaneous choice C determine a class of systems 
{X(n,t):n>t+2}, where U(n,t) = (n,t,P,M,T1,...,Tn). We say that P imple- 
ments C if every run of every system in the class determined by P and C satisfies the 
conditions (i)-(iv) above. A simultaneous choice problem is said to be implementable 
(or satisfiable) if there is a protocol that implements it. 


In addition to simultaneous choice problems, we also consider the closely related 
class of strict simultaneous choice problems. Both classes are specified in essentially 
the same way, except that runs of a protocol implementing a strict simultaneous choice 
are required to satisfy the modified condition 


(i) each nonfaulty processor performs exactly one of the a,’s, 


together with conditions (ii)-(iv) above. All of the results in this paper hold for strict 
simultaneous choice problems as well as simultaneous choice problems, and henceforth 
we will typically mention only simultaneous choice problems explicitly. 


3 We have chosen the set W of nonfaulty processors as the set of processors required to perform actions 
simultaneously, but the notion of a simultaneous choice problem may be stated in terms of many other 
similar (indexical) sets of processors, including the set P of all processors, with the analysis in this 
section and the next one carrying through without change. 
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Having formally defined simultaneous choice problems (and strict simultaneous 
choice problems), let us consider when the specification of such a problem disallows 
performing a simultaneous action a;. Clearly, if con(a;) holds then performing a, is dis- 
allowed. In addition, since by condition (i) no more than one action may be performed 
by the nonfaulty processors in any given run, the condition pro(a;), for some j # 1, 
requires a; to be performed, and hence also disallows a;. It is easy to see that these 
are the only conditions under which performing a; is disallowed. This motivates the 
following definition: 


enabled(a;) # KS eon(a;) A /\\ >pro(a;). 
j#t 

Our discussion above implies that the performance of an action a; is allowed by the 
problem specification iff the condition enabled(a;) is satisfied. Notice that it is possible 
for several of the conditions enabled(a;) to hold at once, in which case performance 
of any of the enabled actions is allowed by the problem specification. In addition, it 
is easy to see that the formulas con(a;) D> —enabled(a;) and pro(a;) D —enabled(a;) 
(j # 1) are valid in any system in which processors follow a protocol implementing a 
simultaneous choice. Finally, notice that because the conditions pro(a;) and con(a;) 
are facts about the operating environment, so is each condition enabled(a;). 


The definition of a simultaneous choice problem is fairly abstract. Many familiar 
problems requiring simultaneous action by a group of processors are instances of a 
simultaneous choice or strict simultaneous choice. In all known cases, the conditions 
pro(a;) and con(a;) are facts about the input and the existence of failures. (By the ezis- 
tence of fatlures we mean whether any failure whatsoever occurs during the run. Some 
problems allow the nonfaulty processors to display default behavior in the presence of 
failures; cf. [LF].) For example, the distributed firing squad problem is a simultane- 
ous choice consisting of a single “firing” action a, with the condition pro(a) being the 
receipt of a “start” signal by a nonfaulty processor, and the condition con(a) being 
that no processor receives a “start” signal. The condition enabled(a) is simply that 
some processor receives a “start” signal. Each set I; of possible inputs simply allows 
for a “start” message to be delivered to any processor at any time. The simultaneous 
Byzantine agreement problem (cf. [DM], [PSL]) is an example of a strict simultaneous 
choice. This problem consists of an action a of “deciding 0” and an action a, of “de- 
ciding 1.” Each set I’; of possible inputs consists of two possible inputs: one starting 
with initial value 0 and receiving no further external input during the run, and the 
other starting with initial value 1. The condition pro(ao) is that all initial values are 0, 
and the condition pro(a;) is that all initial values are 1. The conditions con(ag) and 
con(a;) are both taken to be false. Here the condition enabled(ao) is that some initial 
value is 0, and the condition enabled(a;) is that some initial value is 1. Since for most 
assignments of initial values both enabled(ao) and enabled(a,) hold, it is typically the 
case that deciding either 0 or 1 is acceptable. Simultaneous Byzantine agreement is 
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a strict simultaneous choice, since the processors are required to decide either O or 1 
in every run. Other related problems that may also be formulated as (strict) simulta- 
neous choice problems include weak Byzantine agreement and the Byzantine Generals 
problem (cf. [F]). 


Having formally defined the notion of a simultaneous action, we are now in a position 
to carefully state the relationship between simultaneous actions and common knowledge 
mentioned in the introduction: When a simultaneous action is performed, it is common 
knowledge that the action is being performed. The statement we actually prove is that 
when such an action is performed, it is common knowledge that the action is enabled. 


Lemma 4: Let p be a run of a protocol implementing a simultaneous choice C. If 
the action a; of C is performed by a nonfaulty processor at time @ in p, then (p, 2) 
Cy enabled(a;). 


Proof: Let ~ be the fact “a; is being performed by a nonfaulty processor.” A 
processor p; performing the action a; clearly knows that it is performing a@;. This 
processor therefore also knows that if it is nonfaulty then a; is being performed by a 
nonfaulty processor. Since p is a run of a protocol implementing C, the action a, is 
performed simultaneously by all nonfaulty processors whenever it is performed by a 
single nonfaulty processor. It follows that whenever ¢ holds, so does Eyy, and hence 
yp > Eyg is valid in the system. The induction rule implies that ~ D Cy¢ is valid in 
the system as well. Notice that y > enabdled(a;) is valid in the system. It thus follows 
that Cyyp > Cyenabled(a;) is valid in the system, and hence so is y > Cyenabled(a;). 
Thus, (p,2)  » implies (p,£) F C, enabled(a;), and we are done. Cc 


In the above proof, the essential fact that ~ D> Ey, ¢ is valid in the system depends 
crucially on our definition of Ey. A processor p performing a; knows that a, is being 
performed, but since a nonfaulty processor might not know that it is nonfaulty, p 
might not know that a; is being performed by a nonfaulty processor. The processor p 
does know, however, that if tt (p itself) is nonfaulty, then a nonfaulty processor is 
performing a;. It is for this reason that we have been led to choose our definition 
of Ey as we have, as discussed in the previous section. 


5 Optimal Protocols 


In this section, we show how to extract a high-level optimal protocol for a simultaneous 
choice problem directly from its specification. We begin by considering a simple class 
of protocols that will serve as a building block in the design of such optimal protocols. 
Recall that a protocol is a function specifying the actions a processor should perform 
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and the messages it should send as a function of n, t, and the processor’s view. Thus, 
we can think of a protocol as having two components: an action component and a 
message component. A protocol is said to be a full-information protocol (cf. [Ha], [FL], 
[PSL]) if its message component is: 


repeat every round 
send current view to all processors 
forever. 


Intuitively, since such a protocol requires that all processors send all of the information 
available to them in every round, one would expect this protocol to give each processor 
as much information about the operating environment as any protocol could. In par- 
ticular, the following result shows that if a processor cannot distinguish two operating 
environments during runs of a full-information protocol, then the processor can not 
distinguish these operating environments during runs of any other protocol. 


Lemma 5: _ Let p and p’ be runs of a full-information protocol ¥, and let ¢ and ¢' 
be runs of an arbitrary protocol P corresponding to p and p', respectively. For all 
processors q and times @, if v(q, p,¢) = v(q,p', 2) then v(q,¢, 2) = v(q,¢', 2). 


Proof: We proceed by induction on the time @. The case of £ = 0 is immediate 
since g must have the same initial state in both p and p’, and hence also in ¢ and ¢’. 
Suppose @ > O and the inductive hypothesis holds for all processors p at time @ — 1. 
The view of q at time @ is determined by its view at time @—1, the (external) input it 
receives during round @, and the messages it receives during round @. Since g has the 
same view at time ¢— 1 in p and p’, by the inductive hypothesis, the same is true in ¢ 
and ¢’. Since g receives the same input during round ¢@ in p and p’, the same is true 
in ¢ and ¢’. If g does not receive a message from p during round £ in p and 9’, then 
both operating environments determine that no message from p to q during round @£ is 
delivered. Thus, g does not receive a message from p during round ¢ in either ¢ or ¢’. 
If g does receive a message from p during round @ in p and p’', then both operating 
environments determine that any message from p to q during round @ is delivered. If 
q receives a message from p during round @ of p and p’, then since g must receive the 
same message from p in both p and p’, the view of p must be the same at time 2—1 
in p and p'. By the inductive hypothesis, p’s view at time £— 1 must also be the same 
in ¢ and ¢’. Since P is a deterministic function of a processor’s view, q receives the 
same messages from p during round ¢ in ¢ and ¢’. Thus, g has the same view at time @ 
in ¢ and ¢’. CJ 


Thus, roughly speaking, processors learn the most about the operating environment 
during runs of full-information protocols. The following corollary of Lemma 5 shows 
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that facts about the operating environment become common knowledge during runs of 
such protocols at least as soon as they do during runs of any other protocol. This result 
captures in a precise sense a property of full-information protocols that is essential to 
our analysis. 


Corollary 6: Let » be a fact about the operating environment. Let p and ¢ be 
corresponding runs of a full-information protocol ¥ and an arbitrary protocol P, re- 
spectively. If (¢,£) F Cy then (p,2) FE Cy. 


Proof: Suppose that (¢,@) / Cyy, and suppose that (p, 2) ~ (p', 2) for some run p! 
of ¥. Let ¢’ be the run of P corresponding to p’. Lemma 5 and a simple inductive 
argument on the distance between (p,@) and (',@) in the similarity graph show that 
(o,£) ~ (p', 2) implies (¢,2) ~ (¢',2). Since (¢,£) KH Cuy, we have (¢',£)  ». Since 
corresponding runs satisfy the same facts about the operating environment, (¢',2) F » 
implies (p', 2) — y. It follows that (p,£) / Cy. C 


We are now in a position to describe how to construct optimal protocols for si- 
multaneous choice problems. Recall that when a simultaneous action a; is performed, 
Lemma 4 implies that enabled(a;) must be common knowledge. Since enabled(a;) is 
a fact about the operating environment, Corollary 6 implies that enabled(a;) becomes 
common knowledge in runs of a full-information protocol as soon at it does in corre- 
sponding runs of any other protocol. Thus, given an effective test that the nonfaulty 
processors can use to determine whether enabled(a;) is common knowledge, a test we 
denote by test-for-C,, enabled(a,), the following protocol ¥, is an optimal protocol for C: 


no_action_performed « true; 
repeat every round 
if no_action_performed and test-for-C', enabled(a;) returns true for some a; 
then 
j «+ min {i : test-for-C,enabled(a;) returns true}, 
perform a;, 
no_action_performed «< false; 
send current view to every processor; 
forever. 


Before formally proving that ~ is an optimal protocol, we must define the tests for 
common knowledge that appear in ¥, more formally. Recall that the fixpoint axiom 
implies Cyy D E,Cyg is valid. This guarantees that Cp follows from the view of 
each nonfaulty processor whenever Cyy holds. Notice, that Cyy is not guaranteed to 
follow from the view of a faulty processor. It is therefore natural to define a test for 
common knowledge of y, denoted as above by test-for-Cyy, to be a test that, given 
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the view of a nonfaulty processor at (p,£) (together with n and ¢), returns true iff 
Cyy holds at (p,2). Such a test may return either true or false when given the view 
of a faulty processor. Let us denote by A;(p,£) the set of actions a; such that test- 
for-C,, enabled(a;) returns true when given the view of p; at (p,@). Notice that if p; is 
nonfaulty, then A;{p,) is precisely the set of action a; such that Cyenabled(a;) holds 
at (p,£). It follows that for all nonfaulty processors p; the sets A; are equal at all 
times. In particular, all become nonempty at the same time (as soon as enabled(a;) 
becomes common knowledge for some a;). Thus, if all processors p; choose the action 
of least index from A; as soon as this set becomes nonempty, as required by ¥,, then all 
nonfaulty processors choose the same action simultaneously. We can now prove that % 
is an optimal protocol for C. 


Theorem 7: If C is an implementable simultaneous choice problem, then ¥, is an 
optimal protocol for C. 


Proof: We first prove that nonfaulty processors perform actions in runs of % as soon 
as they do in corresponding runs of any protocol implementing C. Let p be a run of %, 
and let ¢ be the corresponding run of a protocol implementing C. Lemma 4 implies that 
if a; is performed by a nonfaulty processor at time @ in ¢, then (¢,£) F Cy enabled(a;). 
Since enabled(a,) is a fact about the operating environment, Corollary 6 implies that 
(e, 2) — Cyenabled(a;). As a result, A;(p, 2) must be nonempty for all nonfaulty proces- 
sors p;, and hence each must perform an action in p no later that time @. It follows that 
nonfaulty processors perform actions in runs of ¥, as soon as they do in corresponding 
runs of any protocol implementing C. We now show that *% actually implements C. 
Let p be a run of ¥,. First, it is obvious from the definition of # that each non- 
faulty processor performs at most one action in p. (If C is an implementable strict 
simultaneous choice, then the preceding discussion shows that the nonfaulty processors 
perform ezactly one action in p.) Second, if a nonfaulty processor p; performs an ac- 
tion a; at time @ during p, then time @ is the first time at which A;(p,k) is nonempty, 
and a; is the action of least index in this set. Since A;(p,k) = Am(p,k) for all non- 
faulty processors pm, the same is true for all nonfaulty processors. As a result, all 
nonfaulty processors must choose to perform a; simultaneously at time @. Third, if 
p satisfies pro(a;), then the run ¢ of any protocol implementing C corresponding to p 
must satisfy pro(a;), and hence a; must be performed in ¢. As we have already seen, an 
action must also be performed in p. Since pro(a;) > enabled(a;) for all 7 # 7, the set 
A;(p,k) of a nonfaulty processor p; must contain no action other than q; (if it contains 
any action at all). Thus, a; must be the action performed in p. Finally, if p satisfies 
con(a;), then p does not satisfy enabled(a;), and no set A;(p,@) of a nonfaulty proces- 
sor p; contains a;. Thus, a; is not performed in p. It follows that % implements C. 


O 
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Figure 1: Communication graphs. 


As a result of Theorem 7, we see that full-information protocols can be used as the 
basis of optimal protocols for simultaneous choice problems. Thus, we will restrict our 
attention to full-information protocols in the remainder of this paper: unless otherwise 
specified, all protocols mentioned will be full-information protocols, and all runs will 
be runs of such protocols. More importantly, however, a consequence of Theorem 7 
is that designing an optimal protocol for a simultaneous choice problem C essentially 
reduces to testing for common knowledge of certain facts: In order to design an optimal 
protocol for C, it is enough to construct the tests for common knowledge of the facts 
enabled(a;). We note that the fundamental property of common knowledge underlying 
the existence of such tests is the fact that Cyo D Ey is valid; that is, when yp becomes 
common knowledge, the fact that » is common knowledge will follow from the view of 
every nonfaulty processor. The problem of implementing such tests is the subject of 
the following section. 


Before ending this section, however, we consider the size of messages required by 
a full-information protocol ¥. Such a protocol requires processors to send their entire 
view during every round. Since, strictly speaking, a processor’s view may be exponen- 
tial in size, this protocol seems to require processors to send messages of exponential 
length. We now show, however, that there is a simple, compact representation of a 
processor’s view that may be sent instead. Consequently, it will be possible to im- 
plement all full-information protocols (and in particular the optimal protocol %) in a 
communication-efficient way in all variants of the omissions model. 


Given a run p of 7, the communication graph of p (see Figure 1; cf. [Me]) represents 
the messages delivered in p. It is a layered graph (with one layer corresponding to 
every natural number, representing time on the global clock) in which each processor 
is represented by one node in every layer. We denote the node representing processor p 
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at time @ by (p,é). Edges connect nodes in adjacent layers, with an edge between 
(p,k —1) and (q,k) iff a message from p is delivered to q during round k. The labeled 
communication graph is obtained by labeling the layer 0 nodes of the communication 
graph with processors’ names and initial states, and by labeling the layer k nodes (for 
k > 0) with requests the processors receive from external clients during round k. For 
every point (p, £), we denote by G(p, @) the first 2+1 layers of the labeled communication 
graph of p, representing the first 2 rounds of p. For example, illustrated in Figure 1(a) 
is a graph G(p,3) depicting the first 3 rounds of a run p. We say that G(p,2) has 
length 2. We note in passing that in runs of the full-information protocol 7 , the labeled 
communication graph uniquely determines the operating environment. 


Informally, at every point (p,@) of a run of ¥, a processor p,’s view corresponds to 
a certain subgraph G;(p,£) of G(p,@). For example, the subgraph §i(p,3) of G(p,3) 
is illustrated in Figure 1(b). We define the subgraph G,(p,) of G(,@) inductively 
as follows. For £ = 0 the subgraph G;(p,0) consists of the labeled node (p;,0). For 
£> 0 the subgraph G,(p, 2) consists of the labeled node (p,, £2), the edges from (p;,£) to 
layer £— 1 nodes, and the subgraphs §;(p, 2 — 1) for every layer £ — 1 node (p;,é— 1) 
adjacent to (p;,£). Given a set S of processors, it is convenient to denote by Gs(p, £) 
the subgraph of G(p,) consisting of the union of the graphs G;(p,@) for every p; € S. 
We remark that Gs(p,@) uniquely determines G;(p, 2) for every p; € S. The next lemma 
states that a processor’s view of the labeled communication graph uniquely determines 
its view of the run. 


Lemma 8: Let p and p’ be runs of a full-information protocol ¥. For every processor p; 
and time £, u(p;, p,£) = v(p;, 0’, £) iff G:(0,4) = Gi(o', 2). 


Proof: We proceed by induction on @. The case of 2 = 0 is immediate. Suppose 
£> 0 and the inductive hypothesis holds for 2— 1. 


Suppose p; has the same view at time @ in both p and p'. This implies, in particular, 
that p; has the same view at time £—1 in p and p’, and from the inductive hypothesis it 
follows that G;(p,€—1) = G;(p’,@—1). In addition, this implies that p; must receive the 
same input during round @ in p and p’, and hence (p;, é) is labeled with the same input 
in §;(p,£) and §G;(p',2). If p; does not receive a message from a processor p; during 
round @ in p and p’, then there is no edge from (p;,@—1) to (p;,@) in either §,(p, 2) or 
G:(0', 2). If p; does receive a message from a processor p; during round £ in p and p’, then 
it receives the same message in both runs and p; must have the same view at time £—1 
in both runs. Hence, there is an edge from (p;,£— 1) to (p;,£) in both G,(p,2) and 
G:(e',£), and by the inductive hypothesis we have that §;(p,€—1) = G;(o',é — 1). 
Thus, Gile, £) as Gio’, 2). 


Conversely, suppose G;(p,£) = Gi(o', 2). It follows that G,(p,@—1) = G.(o', 2-1), 
and by the inductive hypothesis p; has the same view at time @— 1 in p and p'. The 
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node (p;,£) must be labeled with the same input in G;(p,) and G;(p', 2), so p; receives 
the same input during round @ in p and p'. The edges from layer £— 1 nodes to (p,, £) 
are the same in G;,(p,£) and §;(p',£), so p; receives messages from the same processors 
during round ¢ in p and p'. Again, G;(p,@—1) = G;(p',2—1) for every node (p;,£— 1) 
adjacent to (p;,@), and by the inductive hypothesis p; has the same view at time €~1 
in p and p’. Hence, p; receives the same messages during round £ in p and 9’. It follows 
that p; has the same view at time 2 in both p and p’. Ey 


Consequently, a processor’s view of the run and a processor’s view of the correspond- 
ing labeled communication graph convey the same information: Given either the graph 
Gi(e, 2) or the view vu(p;,p,2), reconstructing the other is straightforward. Therefore, 
an equivalent implementation of a full-information protocol requires the processors to 
send the labeled communication graphs corresponding to their views instead of sending 
their complete views. From now on, we will use the term full-tnformation protocol to 
refer to this equivalent form. It is easy to see that the size of G,(p,£) is polynomial in 
the number of processors n, the global time @, and the size of the requests received from 
external clients. It follows that messages required by a full-information protocol are of 
polynomial size.* Furthermore, given the labeled communication graphs corresponding 
to the views at time £— 1 of the processors that send messages to a given processor p; 
during round @, it is easy to construct the labeled communication graph corresponding 
to the processor’s view at time 2. Thus, the use of such compact representations of a 
processor’s view is computationally efficient as well as communication efficient. Finally, 
recall that we have formally defined a test for common knowledge to be a function of 
processor views (as well at n and t). In light of the preceding discussion, there is no 
loss of generality in assuming that such a test is a function of communication graphs 
corresponding to processor views. We now turn to the problem of implementing such 
tests. 


6 Testing Common Knowledge 


The previous section established the claim that tests for common knowledge provide 
a very powerful programming technique: The design of optimal protocols for simulta- 
neous choice problems reduces to implementing tests for common knowledge of certain 
facts. In this section we investigate the problem of implementing tests for common 
knowledge in the different variants of the omissions model. With such tests, we will be 
able to construct optimal protocols for simultaneous choice problems in these models. 
As we will see, properties of the different variants of the omissions model cause dra- 
matic differences in the complexity of testing for common knowledge. In addition, the 


“In the Byzantine failure models in which processors are allowed to lie (or maliciously deviate from 
the protocol), however, such compact representations are not guaranteed to exist; cf. |C]. 
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optimal protocols we construct will have interesting properties that vary according to 
the failure model. 


Recall that a protocol is a function that, given the number of processors n, the 
bound ¢ on the number of faulty processors, and a processor’s view, yields a list of the 
actions the processor should perform, as well as the messages it should send in the next 
round. (Thus, the protocols we are interested in are uniform in n and t.) Since the 
protocols we will be concerned with are full-information protocols, processors’ views 
will be efficiently representable by labeled communication graphs. We will soon restrict 
our attention to simultaneous choice problems in which the external requests are of 
constant size. This restriction implies that processors’ views at time @ will be of size 
polynomial in n and @. A protocol will therefore determine the messages and actions 
required at time @ based on input of size polynomial in n and £. Consequently, we will 
measure the complexity of computations performed by protocols at time @ in systems 
of n processors as a function of n and £: By polynomial time, polynomial space, etc., 
we will mean polynomial in n and @. 


The definition of simultaneous choice problems presented in Section 4 is very general. 
So general, in fact, that it is possible to define simultaneous choice problems with a 
variety of anomalous properties. For example, it is possible to define a simultaneous 
choice problem in which pro(a) is the fact » = “the first round in which p receives an 
external request is a round whose number is the index of a halting Turing machine” (in 
some a priori well-defined enumeration of Turing machines), and con(a) is -p. Clearly, 
since it is undecidable whether ~ holds even given the view of p after it receives its first 
request, it will also be undecidable which of Cy or C,,-~ holds when processor p’s view 
becomes common knowledge. It follows that this simultaneous choice problem cannot 
be effectively implemented by a computable protocol. Similarly, one can construct 
simultaneous choice problems in which evaluating the conditions is intractable, rather 
than undecidable as in the above example. It is also possible to introduce anomalies by 
defining the sets [; of external inputs in strange ways. Since we are not interested in 
problems involving such inherent anomalies, we will avoid them by making restrictions 
on the relevant facts and the inputs arising in the simultaneous choice problems we will 
consider in the sequel. 


We first define the class of practical facts, which will be used to restrict the conditions 
that specify a simultaneous choice problem. Roughly speaking, one essential property 
of a practical fact y is that it is easy to determine from a processor’s view whether a 
run satisfies ¢. More formally, we denote by “Gs(p,@)” the property of being a run 9’ 
such that 9s(p,2) = Gs(p',é). Consequently, if Gs(p,2) > is valid in a system, 
then every run p! of the system satisfying Gs(p,£) = Gs(e', 2) must also satisfy y. In 
this case, we say that Gs(p,@) determines y. Notice, for example, that no finite labeled 
communication graph Js(p, 2) can determine that a run is failure-free. With this notion 
in mind, a fact ¢ is said to be practical within a class of systems {X(n,t) : n >t +2} 
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if the following conditions hold: (i) ¢ is a fact about the input and the existence of 
failures, and (ii) there is a polynomial-time algorithm to determine, given n, t, and 
a graph Gs(p,@) at a point of U(n,t), whether Gs(p,2) D> oy is valid in U(n,t). The 
first condition is justified by the fact that we will be testing for common knowledge 
of the conditions enabdled(a;) arising from natural simultaneous choice problems, since 
such conditions are typically conditions on the input and existence of failures. The 
second condition ensures that it is easy to test whether a labeled communication graph 
determines that the fact holds. 


We now consider a natural restriction on the sets [; of possible inputs. A class 
of systems is said to be practical if there are two fixed finite sets S and M of initial 
states and external requests, respectively, such that each I; in all systems of the class 
is the set of complete input histories whose initial state is in S, and in which the input 
received in every round is a subset of M. This condition ensures that the input sets 
are of a simple form. In particular, it implies that all I’;’s are identical, and that the 
input received by a processor during any given round is of constant size. 


Having defined the notions of practical facts and practical classes of systems, we say 
that a simultaneous choice C is practical if (i) the class of systems determined by a full- 
information protocol and C is practical, and (ii) each condition enabled(a,) is practical 
within this class of systems. Essentially all natural simultaneous choice problems are 
practical. In particular, all simultaneous choice problems in the literature are practical. 
Our analysis will hence be restricted to testing for common knowledge of practical facts 
and to designing and implementing optimal protocols for practical simultaneous choice 
problems. We remark, however, that our analysis will apply to a more general class of 
simultaneous choice problems, whose precise characterization is somewhat complicated. 


In Section 5 we programmed protocols for simultaneous choice problems in a high- 
level language in which processors’ actions depend on explicit tests for common knowl- 
edge. Recall that test-for-C, enabled(a;) is a test nonfaulty processors can use to de- 
termine whether enabled(a;) is common knowledge: Given the graph corresponding 
to the view of a nonfaulty processor at (p,@), test-for-C, enabled(a;) return true iff 
(p,£) - Cyenabled(a;). Theorem 7 implies that given such a test for each condition 
enabled(a;), the protocol #, is an optimal protocol for C. Until this point, however, we 
have sidestepped the issue of whether such tests actually exist. With the next lemma 
we see that, for practical simultaneous choice problems, such tests can be implemented 
in polynomial space. 


Lemma 9: If C is a practical simultaneous choice problem, then for each action a, 
the test test-for-C, enabled(a;) can be implemented in polynomial space. 


Proof: We must exhibit an algorithm test-for-C,y enabled(a;) determining in poly- 
nomial space whether enabled(a;) is common knowledge at (p,2@), given n, t, and the 
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graph G;(p, 2) corresponding to the view of a nonfaulty processor p; at (p,@). We will 
actually exhibit a nondeterministic algorithm A; determining whether enabled(a;) is 
not common knowledge at (p,£). Since NPSPACE=PSPACE and PSPACE is closed 
under complementation (cf. [HU]), the existence of the algorithm A; implies the ex- 
istence of an algorithm test-for-C, enabled(a;). Let {X(n,t) : n >t+ 2} be a class of 
systems determined by a full-information protocol and the problem C. We claim that 
such an algorithm A; need only guess a point (7, ¢), and show both that (p, @) ~ (n, @) 
and that G(n,€) > enabled(a;) is not valid in U(n,t). Since G(n,£) D enabled(a;) is 
not valid in the system, there must be a point (7',2) such that G(n,@) = G(n',@) and 
(n', 2) JK enabled(a;). Let ¢ be a run with the input of either 7 or 7’ in which the only 
processors failing are those recorded as failing in G(n,2) = G(n',é). Any nonfaulty 
processor in 7 or 7’ must also be nonfaulty in ¢, and also have the same view at time 2 
in both runs. Consequently, (n,@) ~ (¢,2) and (¢,@) ~ (n', 2). Therefore, (p, 2) ~ (n', 2) 
and (n', 0)  enabled(a;), so (p,£) K Cy enabled(a;). 


We now describe the algorithm A; in greater detail. Notice that since C is practical, 
the input received by a processor at each round during a run of X(n,t) is of constant size, 
and hence it is possible to construct the labeled communication graph of any point of 
X(n,t) in polynomial space. The algorithm A; begins by constructing the graph G(p', é) 
of a run p' by adding to the graph G,(p,@) received as input all edges not recorded as 
missing in G,(p, 2). Notice that since p; is nonfaulty in p, it is nonfaulty in p' as well, and 
hence that (p,£) ~ (p',é). The algorithm A; then shows that (p', 2) ~ (n,£) (and hence 
that (p,£) ~ (n,)) in polynomial space by constructing one by one the graph G(¢, @) 
of each point (¢,;,@) in a path from (’, 2) to (n, @) in the similarity graph. For each pair 
of points (¢-1, 2) and (¢, 2), the algorithm shows that some nonfaulty processor p,; has 
the same view at both points by choosing p,, exhibiting for each point an assignment of 
faulty processors (consistent with their respective graphs) in which p, is nonfaulty, and 
showing that p, has the same view at both points by verifying Gx(¢-1,¢) = 9x(¢,@). 
Finally, since enabled(a;) is a practical fact, A; can show in polynomial time (and hence 
in polynomial space) that G(n,) > enabled(a;) is not valid in the system U(n,t). O 


We note that the proof of Lemma 9 actually shows that testing for common knowl- 
edge of any practical fact can be done in polynomial space. In fact, the proof shows 
that such tests have effective implementations even when the algorithm determining 
whether G(p,@) D enabled(a;) is valid does not run in polynomial time (although the 
problem must still be decidable). In this case, however, the test is guaranteed to run in 
polynomial space only if this computation can be performed using polynomial space. 
The most important consequence of Lemma 9, however, is that practical simultaneous 
choice problems have polynomial-space optimal protocols. 


Theorem 10: If C is an implementable practical simultaneous choice problem, then 
there is a polynomial-space, optimal protocol for C. 
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With Theorem 10 we see that practical simultaneous choice problems do have ef- 
fective optimal protocols. In general, however, connected components in the similarity 
graph may be of exponential size, and paths in such components may be of exponential 
length. It therefore follows that the polynomial-space protocol given by Theorem 10 
requires the processors to perform exponential-time computations between consecutive 
rounds of communication. The resulting protocol is therefore clearly not a reasonable 
protocol to use in practice. A crucial question at this point is whether there are efficient 
optimal protocols for simultaneous choice problems. Recall that we have already seen 
that optimal protocols can be implemented in a way that makes efficient use of commu- 
nication. The rest of the paper is devoted to investigating ways of implementing tests for 
common knowledge in variants of the omissions model in a computationally-efficient 
manner, and therefore of implementing efficient, optimal protocols for simultaneous 
choice problems in these models. 


6.1 The Omissions Model 


In this section we consider the problem of efficiently implementing tests for common 
knowledge in the omissions failure model. Recall from Theorem 2 that the connected 
component of a point in the similarity graph completely determines what facts are 
common knowledge at that point. We develop an efficient construction that crisply 
characterizes the connected component of a point in the similarity graph. With this 
construction we devise efficient tests for common knowledge, and hence efficient proto- 
cols for simultaneous choice problems that are optimal in all runs. 


The construction itself is motivated by a careful analysis of what facts do not be- 
come common knowledge during runs of a full-information protocol. (Unless otherwise 
mentioned, all protocols referred to in this section will be full-information protocols.) 
We begin with a technical result, similar to Lemma 15 of [DM]. In this and following 
results, it will be necessary to refer to runs whose operating environments differ slightly 
from each other. Therefore, given two runs of a protocol ¥, we will say that a run p 
differs from a run p' only in a certain aspect of the operating environment, if p is the 
result of executing ¥ in an operating environment that differs from that of p' only in 
the said aspect. Notice that the messages sent in p may actually be quite different 
from those sent in p’. We say that a processor is silent from time k if it fails to send 
all messages in rounds following time k. 


Lemma 11: Let p and p’ be runs differing only in the (faulty) behavior displayed by 
processor p after time k, and suppose no more than f processors fail in either p or p’. 
Ifé—k<t+1-— f, then (p,2) ~ (', 2). 


Proof: Ifk > £then G(p,2) = G(p', £2), and Lemma 8 implies that (p,2) ~ (', @). 
Therefore, assume k < @. We proceed by induction on 7 = £—k. 
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Suppose 7 = 1 (that is, k = £—1). Without loss of generality, we may assume 
that p and p’ actually differ in the faulty behavior of p, and hence that p fails in one of 
these runs. Notice that since p already fails in one of these runs and yet no more that t 
processors fail in either run, it is clear that at most t processors fail in a run differing 
from either run only in the faulty behavior of p. Now, since t < n — 2 and since p 
and p' differ only in the behavior of p, there are two processors g and r (other than p) 
that do not fail in either run. Let p, be the run differing from p only in that p sends 
to r during round @ of p, iff it does so in p'. Since q’s view at time @ is independent 
of whether p sends to r during round @, we have (p,£) ~ (p,,@). Let p} be the run 
differing from p, in that p sends to precisely the same processors during round @ in p}, 
and p’. Since r’s view at time @ is independent of whether p sends to the remaining 
processors during round £, we have (p,,£) ~ (p!,£). Since G(p!, 2) = G(p', 2), it follows 
that (p},2) ~ (»', 2). Thus, by the transitivity of “~,” we have (p,@) ~ (p’, @). 


Suppose j > 1 (that is, k < £—1) and the inductive hypothesis holds for 7—1. Let p,; 
be the run differing from p only in that for each processor gq in {p,,...,p;} processor p 
sends to qg during round k + 1 in 9p; iff it does so in p'. Notice that p = po and p' = py. 
We will show that (p,@) ~ (p;,@) for all i > 0. Since p, differs from p’ only in the 
faulty behavior of p after time k +1, and since 2— (k+1) = j7 —1, it will follow by the 
inductive hypothesis for 7 — 1 that (pn, @) ~ (p', 2). Finally, by the transitivity of “~,” 
we will have (p, 2) ~ (p', 2) as desired. 


We now proceed by induction on 1 to show that (p,@) ~ (;,2) for alli > 0. The 
case of 1 = 0 is trivial. Suppose z > 0 and the inductive hypothesis holds for 1 — 1; that 
is, (p,£) ~ (p:-1,£). Notice p;-1 and p; differ at most in whether p sends a message 
to p; during round k+ 1. Let 7 be the run differing from p;-, in that p; is silent from 
time & +1 in 7. Suppose no more than g processors fail in either p;-1 or 7. Notice that 
g<f+1. Therefore, since 1 <@—k<t+1-—f we have f <tandg<t,soat mostt 
processors fail in 7. Furthermore, @—(k +1) <t+1—(f+1) <t+1-g. Since, 
in addition, p;_; and 7 differ only in the faulty behavior of p,; after time k +1, the 
inductive hypothesis for 7 — 1 implies (p;-1,2) ~ (, €). Now, since p; is silent from time 
k+1 in n, the view of a nonfaulty processor at (7, @) is independent of whether p sends 
to p; during round k + 1, so (n,) ~ (n', @) where 7’ differs from 7 in that p sends to p; 
during round k + 1 in 7’ iff it does so in p;. Again, the inductive hypothesis for 7 — 1 
implies that (n', 2) ~ (p;,2). By the transitivity of “~,” it follows that (p,@) ~ (i, @). 

CJ 


While Lemma 11 is a technical lemma in the context of this work, it has a number of 
interesting consequences in its own right. In particular, the (t+1)-round lower bound on 
the number of rounds required for simultaneous Byzantine agreement is an immediate 
corollary of this lemma. The resulting proof of this lower bound is perhaps the simplest 
to appear in the literature (see [DM] for details). More importantly, however, with 
Lemma 11 we can prove two corollaries that will enable us to characterize the connected 
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The run 1. The run po. 


Figure 2: Runs illustrating Lemma 12. 


components of the similarity graph. Consider the runs p; and p2 of Figure 2, where 
we indicate only faulty behavior: solid lines indicate silence, and dashed lines indicate 
sporadic faulty behavior. Notice that f processors fail in p,. In the following lemma 
we show that (1,2) ~ (p2,2) where p2 differs from p; only in that processors failing 
in p; are silent in p2 from time k, where k = £~ (t + 1— f). This implies, for instance, 
that the views at time k of processors failing in p; are not common knowledge at time @ 
since these processors are silent from time k in pz. 


Lemma 12: Let p, be a run in which f processors fail. Let pz be the run differ- 
ing from p, only in that processors failing in p; are silent from time k in p2, where 


k=£—(t+1-—f). Then (1,2) ~ (2,2). 


Proof: Let q¢,,---,¢» be the faulty processors in p;. Let n; be the run differing 
from p; in that processors q,,...,q; are silent from time k in n;. Notice that pp; = no 
and p, = n,. We proceed by induction on 1 to show that (p1,@) ~ (n;,) for all 1. The 
case of z = 0 is trivial. Suppose 1 > 0 and the inductive hypothesis holds for 1 — 1; that 
is, (91,2) ~ (nj-1,@). Since n;_, and n; differ at most in the faulty behavior of g; after 
time k, it follows by Lemma 11 that (n;-1,€) ~ (n:,@). By the transitivity of “~,” we 
have (p1, 2) ~ (nj, 2). CJ 


Before discussing the second Jemma, we make an important definition. Given a 
point (p,k) and a set of processors G, let 


B(G, p,k) def {p:(p,k) — Ic(“p ts faulty”)}. 


By this definition, B(G,p,k) is the set of processors implicitly known by G at (p,k) 
to be faulty. An important property of the omissions failure model is that processors 
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The run 2. The run p3. 


Figure 3: Runs illustrating Lemma 13. 


fail only by failing to send messages. It follows that G implicitly knows at (p,k) that a 
processor p is faulty iff G implicitly knows at (p,k) of some processor g not receiving a 
message from p before time k; that is, Gg(p,k) contains no edge from (p, £ — 1) to (g, £) 
for some node (q,£) of Gg(p,k). It is therefore simple and straightforward to compute 
B(G, p,k) given Gg(p,k). 


The essence of the second lemma is captured by the runs p2 and pg of Figure 3. In 
the run p2, the f faulty processors are silent from time k = £— (t+ 1-— ff). The set 
G is the set of nonfaulty processors and B = B(G,p2,k). The run ps differs from p2 
only in that processors in P — B do not fail in pg. The following lemma states that 
(p2,£) ~ (ps, 2). This implies, for instance, that the failure of processors in P — B can 
not be common knowledge at (2, @) since they do not fail in pg. Formally, we have (see 
Figure 3): 


Lemma 13: Let p2 be a run in which the f faulty processors are silent from time 
k = £—(t+1—f). Let G be the set of nonfaulty processors in p2, and let B = B(G, p2,k). 
Let ps be the run differing from p2 only in that processors in P — B do not fail. Then 
(02, 2) = (es, 2). 


Proof: If a processor p in P — B fails to a processor g during some round 7 < k 
of p2, then the node (gq, 7) must not be a node of Gg(p2,k) or the failure of p would be 
implicitly known by G at time k and p would be in B, a contradiction. Thus, Gg¢(p2, k) 
is independent of whether G(p2,k) contains an edge from p to q during round 7. Let p} 
be a run differing from p2 only in that no processor in P — B fails before time k in p}. 
By the previous discussion, Gg(p2,k) = Go(p),k). In both pz and p), every processor 
in G successfully sends every message after time k and every processor in B= P—G 
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Figure 4: An example of the construction when t = 9. 


is silent from time k. Since, in addition, every processor in G receives the same input 
after time k in p2 and p), we have Ge(p2,2) = Gap), 2). Given that G is the set of 
nonfaulty processors in p2, each of which is also nonfaulty in p}, it follows by Lemma 8 
that (p2,£) ~ (p5,@). Since the runs p, and ps differ only in the faulty behavior of 
processors in P — B after time k, by repeated application of Lemma 11 it follows that 
(0, £) ior (ps, £). Hence, (p2, £) ~ (ps, é). O 


Having seen Lemmas 12 and 13, let us consider how these results suggest a charac- 
terization of the similarity graph, and hence of what facts are common knowledge at 
a given point. Going back to Figures 2 and 3, notice that if f' < f (where f' is the 
number of processors in B), then by setting p| = ps we can apply Lemmas 12 and 13 
again. Iterating this process, we reach a run # satisfying (1, ! £) ~ (é,) where the f 
processors failing in f are silent from time k=e- (¢(+1-f fy, and where all faulty 
processors are implicitly known to be faulty by the nonfaulty processors at (A, k). No- 
tice that the run # is a fixpoint of this iterative process (that is, applying Lemmas 12 
and 13 to # yields f itself). We claim, in addition, that the joint view of the nonfaulty 
processors at (A, k) characterizes the connected component of (1, @) in the similarity 
graph, and hence what facts are common knowledge at (p1,¢). In order to make this 
claim precise, we now formalize a local version of this iterative process, illustrated in 
Figure 4, that processors can use to construct locally this joint view. 


Let p be an arbitrary processor. We define Go = {p} and ko = @, and we define 
G41 and k,;41 inductively. Denoting B(G;,p,k;) by B,, let 


Gi41 = eS B; 
kw, = €—(t+1—|B;l). 
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Notice that when k;41 < 0, the view at time k;,, of every processor in G;4; is the 
distinguished empty view, and hence B;,; must be empty. As a consequence, for all 
j >1+1, we have that G; = P, k; = @— (t +1), and B; is empty. This construction 
determines three (infinite) sequences {G;}, {k;}, and {B;}. In the next few pages we will 
see that these sequences have limits G, k, and B, and that these limits are independent 
of the processor with which the construct is begun. As a result, individual processors 
will be able to construct these values based solely on their local view. We will see 
that the joint view of G at time k completely characterizes the connected component 
of (p, 2) in the similarity graph, and hence what facts are common knowledge at (p, @). 
This construction will therefore provide an efficient way of determining what facts are 
common knowledge at a given point. 


Among other things, this construction captures a number of essential aspects of the 
information flow during the run up to time @. In particular, one important property of 
this construction is the following: 


Lemma 14: Every processor in G;,; successfully sends to every processor in G; in 
every round before time k;. 


Proof: Suppose some processor q of G;+; fails to send to a processor q’ of G; during 
a round before time k;. Then q’s failure to gq’ is implicitly known by G; at time k;, so 
q © B; and q ¢ Gj41, a contradiction. O 


One consequence of Lemma 14 is that the view of the processor p at time £ must 
contain the view of every processor in G; at time k; for every 1 > 0. Thus, an essential 
property of the construction is that it depends only on the view of processor p at (p, @), 
and hence that p is able to compute these sets locally. A second essential property of 
the construction is that it converges within t+ 1 iterations, as we see with the following 
result. 


Lemma 15: lim G; = Gi41 and lim kj = ky41. 


#—+00 


Proof: We will show that B;,,; C B; for all 1 > 0. Since Bo contains at most t 
processors, it will then follow that there must be an 7 < t for which B; = B,4,. From 
the definition of the construction, it is easy to see that we will have B; = B,+; for all 
j = 0. In addition, we will have Gji1 = Gi414; and ky41 = ky414,; for all 7 > 0, and 
we will be done. We proceed by induction on 7. If kj41 < 0, then B;4; is empty and 
Bii1 C B;, so let us assume k;,,; > 0. Suppose 1 = 0. By Lemma 14, every processor 
in G, must send to every processor in Go during round kj + 1. It follows that any 
failure implicitly known by G, at time k, must be implicitly known by Go at time ko. 
Thus, B, C Bo. Suppose t > O and the inductive hypothesis holds for 1 — 1; that is, 
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B; C B;-1. If B; = Bj-1, then By = B;. If By c B-1, then kit < kj. By Lemma 14, 
G41 sends to G; during round k;,; + 1, so Biy1 C By. O 


We denote the results of the construction (the limits of the sequences {G,}, {ki}, 
and {B;}) by G,k, and B. We denote these values by G(p, 0, £), k(p, p, £), and B(p, p, é) 
when the processor p and the point (,@) are not clear from context. We now show, 
however, that these values are independent of processor p. 


Lemma 16: For all processors p and q, G(p, p,£) = G(q, p, £) and k(p, p,£) = k(q, p,). 


Proof: We prove the claim by showing that B(p, p, 2) = B(q,p,2). Given that 
B; uniquely determines G,,,; and k,,,, this will imply the desired result. It suffices 
to show that B(p, p,t) Cc B(q, p,£), since the other direction will follow by symmetry. 
Denote the intermediate results of the construction from the point (p, 2) starting with 
the processor p by G;, k;, and B;, and the final results by G, k, and B. Similarly, 
denote the intermediate results of the construction starting with q by G}, ki, and Bi, 
and the final results by. G', k', and B'. We now show that B Cc B’. If k < 0, then 
B is empty and Bc B', so assume k > 0. We consider two cases. First, suppose 
k = £~1. In this case, B must contain t faulty processors since k = €—(t+1-|B)). It 
follows that every processor in G must be nonfaulty and hence must send to G4 during 
round &k + 1, so B C Bi. Since, in addition, |Bo| < t and |B| = t, we have B= Bo. 
It follows from the construction that B = B! for every 1 > 0, and hence that B= B'. 
Now, suppose k < £—1. Let r be an (arbitrary) nonfaulty processor in p. We claim 
that every processor g in G must send its view to r during round k +1. Suppose some 
processor g in G does not. Let j be the least integer such that G= G;. If j = 1, 
then r must send to Go during round k+2. If j >1, then r must actually Be a member 
of G;_1 since G;_; must contain all of the nonfaulty processors. In either case, the 
failure of g tor during round k +1 must be implicitly known by G;- at time k,_1, 
sog € Bj-1. Since G= G; = P — B;-1, we have g ¢ G, a contradiction. Thus, every 
processor in G must send to r during round k+1. We now proceed by induction ont 
to show that B C Bi for all i > 0. Suppose t = 0. Every processor in G must send 
to the nonfaulty processor r during round k +1, and r must send to Go during round 
k + 2, so B c Bo. Suppose 1 >0 and the inductive hypothesis holds for 1 — 1; that i is, 
Bc Bi_,. If B= Bi_,, then B= BI BC Bi_,, then k< kj. Every processor in G 
must oad to the nonfaulty processor r during round k+ 1, and r must be contained 
in Gt, so Bc Bj. It follows that Bc B} for all ¢ > 0, and hence BcB. LJ 


As a result of Lemma 16, we see that G, k, and B depend only on the point (, £), 
and not the processor with which the construction begins. Thus, a third essential 
property of this construction is that every processor (and not just, say, the nonfaulty 
processors) is able to compute locally the values of G. k, and B. We will denote these 
values by G(0, 2), k(p, 2), and B(p, £) when (, 2) is not clear from context. From the 
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definition of the construction it is clear that the driving force behind the construction 
is the identity of the sets B;. Notice that these sets are uniquely determined by the 
failure pattern, and do not depend on the run’s (external) input. Taking into account 
the external input of a run, we are now in a position to show how the construction 
characterizes the connected components in the similarity graph. Denoting G (p,£) by C 
and k(p, 2) by k, we define 


V(p,£) & v(G,p, k). 
This definition says that V (0,2) is the joint view of the processors in G(p, £) at time 


k(p, 2). Our next lemma implies that V is the same at similar points, which implies 
that the joint view V(p, 2) is common knowledge at (p, @). 


Lemma 17: If (p,£) ~ (p',£) then V(p, 2) = V(p', 2). 


Proof: We proceed by induction on the distance d between the points (p,£) and 
(p', 2). The case of d = 0 is trivial. Suppose that d > 0 and the inductive hypothesis 
holds for d—1. Since the distance between (p',2) and (p,@) is d, there must be a 
point (7,2) whose distance from (p,£) is d— 1, and whose distance from (p’,@) is 1. 
The inductive hypothesis implies that V(p,£) = V(n, £), and we must have v(p,n, 2) = 

v(p, p',£) for some processor p. As a consequence of Lemmas 14 and 16, the values 
of V(n, 2) and V (p! ,£) depend only on the view of p at (n,@) and (p', 2), respectively. 
Since p has the same view at (n,@) and at (p',@), we have V (n, 2) = V (o', £). Since 
V (p, 2) = V(n, 2), it follows that V(p,£) = V(p', 2). 0 


Conversely, we wish to show that all points with the same V are similar, and hence 
that V completely characterizes the connected components of the similarity graph. 
Before we do so, however, we formalize the reasoning with which Lemmas 12 and 13 
motivated consideration of the construction in the first place. 


Lemma 18: Let p be a run, and let G, k, and B be the results of the construction 
from (p,@). Let p’ be the run differing from p only i in that processors in G do not fail 
in p’ and processors in B are silent from time k in p’. Then (p,£) ~ (', 2). 


Proof: Let G;, k;, and B; be the intermediate results of the construction from (p, £) 
starting with the nonfaulty processor p;. For 1 > 0, define p; to be the run differing 
from the run p only in that processors in B; are silent from time k; in p; and the 
remaining processors do not fail in p;. Notice that p = po and p’ = 9p; for sufficiently 
large 1. We proceed by induction on 7 to show that (p,€) ~ (p;, 4) for all i > 0. Suppose 
i = 0. Since the subgraph §,;(p,2) must be independent of whether the graph G(p, @) 
is missing an edge from a processor in P — Boy to a processor other than p;, we have 
G;(0,2) = G;(p0,ko). Since processor p; is nonfaulty, it follows that (p,@) ~ (#0, @). 
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Suppose ¢ > 0 and the inductive hypothesis holds for 1 — 1; that is, (p,@) ~ (p,-1, 2). 
Lemma 12 implies (p;-1, 2) ~ (pj_1,@) where p}_, differs from p,;_; in that processors in 
B;_1 (the processors failing in p;_,) are silent from time k; in p{_,. Lemma 13 implies 


(o'_15€) ~ (0:0). Thus, (0,0) ~ (0;,£). O 


Finally, we have the following: 
Lemma 19: If V(p, 2) = V(9', £) then (9, £) ~ (p', 2). 


Proof: The fact V(p,£) = V(o',é) implies G(p, 2) = G(o', 2), k(p, 2) = k(p', 2), and 
B(p, 2) = B(p! ,£). We therefore denote these values by G, k, and B. Let ¢ be a run 
that differs from p in that processors in G do not fail in ¢, and processors in B are silent 
from time & in ¢. Let ¢! be an analogous run with respect to p'. Lemma 18 implies 
that (p, 2) ~ (¢, 2) and (p', 2) ~ (¢', 2). In order to show that (p, @) ~ (9', 2), it is enough 
to show that (¢,@) ~ (¢',2). Suppose G= {q:,---,Qs}, and let ¢; be the run differing 
from ¢ in that q,,...,q; receive the same input after time k in ¢; as they do in ¢’. We 
proceed by induction on 7 to show that (¢,@) ~ (¢,2@) for all 1 > 0. Since p = po, the 
case of 1 = 0 is trivial. Suppose i > O and the inductive hypothesis holds for i — 1; that 
is, (¢,2) ~ (¢-1,2). Let 7,1 and n; be runs differing from ¢_ and ¢;, respectively, only 
in that q; is silent from time & in n;_, and ni. Lemma 11 implies (¢-1,€) ~ (m-1, 2) 
and (¢;,£) ~ (n;, @). In addition, since n;-; and n; differ only in the input received by q; 
after time k, and since q; is silent from time & in both runs, we have (ns-1,£) ~ (ni, 2). 
Thus, (¢,2) ~ (¢,) for all ¢ > 0. In particular, (¢, 2) ~ (c,2). In order to complete 
the proof, it now suffices to show that (¢,,£) ~ (¢',@). Since Gg(o, k) = Ge(o',k), 
(p, 2) ~ (¢, 4) and (p',£) ~ (¢’,£), Lemma 17 implies that Gg(¢,k) = Gels’, k). Notice 
that Ge(te,k) = Ge (s, k) = Gels’, k). Notice, in addition, that processors in G do not 
fail in either ¢, or ¢’, and that the remaining processors (in B) are silent from time k 
in both runs. Finally, notice that processors in G receive the same input after time k 
in both runs. It follows that Ge(¢,£) = Ga(¢', 2), and hence that (¢,,2) ~ (¢',£). Thus, 
(¢, 2) ~ (¢', 2), as desired. Oj 


Combining Lemmas 17 and 19 we see that (p,¢) ~ (', 2) iff V(p,2) = V(p ',£). We 
therefore have: 


Theorem 20: (p,2) - Cy iff (p',£) — ¢ for all p' satisfying V(p,£) = V (o', 2). 


It follows that the identity of Vina precise sense summarizes and uniquely deter- 
mines the set of facts that are common knowledge at any given point. The identity of 
V can be thought of as being composed of two components: The identity of G and k, 
and the information about the input that is contained in the joint view V. The def- 
inition of the construction implies that the identities of G and k depend only on the 
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failure pattern, and hence carry only information about the failure pattern. The fact 
that V becomes common knowledge implies that certain information about the failure 
pattern must become common knowledge. It is difficult, however, to characterize the 
facts about the failure pattern that follow from the identity of G and k in terms of 
the communication graph G (p, £). On the other hand, information about the input 
that follows from the views in V does characterize in a crisp way what facts about 
the input are common knowledge. Furthermore, it is easy to deduce from V whether 
the existence of a failure is common knowledge. As the following corollary will show, 
Theorem 20 implies that facts about the input and existence of failures that are com- 
mon knowledge at the point (p,£) must follow directly from the set V (p, 2). We now 
make this statement precise. A run p, a set of processors G, and a time k determine 
a joint view V = v(G,p,k). We denote by “V” the property of being a run in which 
the processors in G have the joint view V at time k (notice that G and k are uniquely 
determined by V). Thus, if V > ¢ is valid in the system, then every run p! satisfying 
v(G, p',k) = V must also satisfy ~. We now have: 


Corollary 21: Let ~ be a fact about the input and the existence of failures, and let 
V =V(p,@). Then (p,2) E Cy iff V > ¢ is valid in the system. 


Proof: Let V = V (0, 2). Suppose V > 9 is valid in the system. By Lemma 17, 
we have V (p, 2) = V(0', 2) for all runs p’ such that (p,£) ~ (',é), and hence that 
(o', 2) EV for all such p'. Given that V > ¢ is valid in the system, we have (p',£) F 
for all such p’. It follows that (,£) F Cy. 


Conversely, suppose that V > is not valid in the system. Since V > ¢ is not valid 
in the system, let 7 be a run such that (n,@)  V and yet (n,@) EK ». We will construct 
arun ¢ such that (p,@) ~ (¢,@), ¢ and n have the same input, and ¢ and n are the same 
with respect to the existence of failures. Since is a fact about the input and the 
existence of failures, (7,2) JK » will imply (¢, 2) KK y. Since, in addition, (p, 2) ~ (¢, 2), 
we will have that (p,2) -K Cy. 


We construct ¢ in two steps. We first construct a run € with the input of 7 satisfying 
(p, 2) ~ (€, 2). Let € be the run with the failure pattern of p and the input of 7. Given 
that p and € have the same failure pattern, and that G and k depend only on the 
failure pattern, we have that G(p,£) = GC, é) and k(p,0) = = k(é, ). Let us denote 
these values by ef and k. Since (n, 2) — V, we have v(G ,p,k) = o(G,n, k), and hence 
Galo, k) = Gea(n, k). Since € and p have the same failure pattern, the unlabeled graphs 
underlying Ga(€, k) and Gale, k) (and hence also Geln,k)) 3 are the same. Furthermore, 
since € and 7 have the same input, it follows that Galé, k) and Ga(n, k) (and hence 
also Ge(0, k)) are equal. Since Ga(p, k) = = GelE,k k) implies V (0, 2) = V(é,2), we have 
(p, 2) ~ (€,2) by Lemma 19. 
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We now consider the existence of failures, and construct the desired run ¢. If there 
is a failure in 7, then let ¢ be a run differing from € only in that a processor fails 
after time £ in ¢. Clearly (€,2) ~ (¢,£), and hence (p,@) ~ (¢,2).. Conversely, if 7 is 
failure-free, then let ¢ = 7. Since n is failure-free, no processor in G knows of a failure 
at time & in n. Since processors in G have the same view at time & in both n and p, 
the same is true of p. It follows that B = B(G, p, k) is empty, and since G = P — B, 
we have that G = P. Notice that ¢ differs from € only in that processors in G =P do 
not fail in ¢, and hence that (¢,£) ~ (¢,2) by Lemma 18. Therefore, (p,2) ~ (¢,@). In 
either case, (p,€) ~ (¢,2), ¢ and 7 have the same input, and are the same with respect 
to the existence of failures. It follows by the above discussion that (p,@) FCyp. O 


Corollary 21 summarizes the sense in which the construction allows us to test 
whether relevant facts are common knowledge at a given point. Let us consider the 
computational complexity of performing such tests. The first step in applying Corol- 
lary 21 to determine whether a fact is common knowledge at (p,@) is to construct 
V (p, 2). Recall that a group of processors implicitly knows that a processor is faulty 
iff it knows of a message the processor failed to send. This is an easy fact to check 
given the communication graph corresponding to the group’s view. It follows that 
computing every iteration of the construction can easily be done in polynomial time. 
Furthermore, since the construction is guaranteed to converge within t+ 1 iterations, it 
follows that G and k, and hence also V can be computed locally in polynomial time (as 
long as V is of polynomial size). Recall that if » is a practical fact, then it is possible to 
determine in polynomial time whether or not V > ¢ is valid in the system. Thus, given 
a practical simultaneous choice problem C, one polynomial-time implementation of a 
test for common knowledge of enabled(a;) is to construct the set V = V and determine 
whether V > enabled(a,;) is valid in the system. As a result, Theorem 7 implies the 
following: 


Theorem 22: If C is an implementable, practical simultaneous choice, then there is 
a polynomial-time optimal protocol for C. 


We reiterate the fact that the resulting protocol for C is optimal tn all runs: for 
any given operating environment, actions are performed in runs of % as soon as they 
could possibly be performed by any other protocol. Thus, for example, simultaneous 
Byzantine agreement is performed in anywhere between 2 and t+ 1 rounds, depending 
on the pattern of failures (as is shown in [DM] to be the case in the crash failure 
model). Similarly, the firing squad problem can be performed in anywhere between 1 
and ¢+ 1 rounds after a “start” signal is received. Paradoxically, in all these cases, the 
simultaneous actions can be performed quickly only when many failures become known 
to the nonfaulty processors. In particular, if there are no failures, no fact about the 
input is common knowledge less than ¢ + 1 rounds after it is first determined to hold. 
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Notice that every processor, faulty or nonfaulty, is able to compute the set V (p, 2) 
locally. As a result, the following proposition shows that a fact is common knowledge 
to the nonfaulty processors iff it is common knowledge to all processors. 


Proposition 23: Let ~ be an arbitrary fact. In the omissions model, Cyy = Cpp is 
valid in all systems running a full-information protocol. 


Proof: By Theorem 2, it is enough to show that (p,£) ~ (p', €) iff (p, 2) ~ (9', 2) for 
all runs p and p' and times @. The ‘if’ direction is trivial, since N C P. The proof of 
the other direction is identical to the proof of Lemma 17, interpreting ~ as ~. C] 


Proposition 23 implies that all processors (even the faulty processors) know exactly 
what actions are commonly known to be enabled in runs of #. Thus, in this model the 
protocol ¥ is guaranteed to satisfy a stronger version of simultaneous choice problems, 
in which condition (ii) is replaced by 


(ii’) if a; is performed by any processor (faulty or nonfaulty), then it is performed by 
all processors simultaneously. 


Furthermore, since when an action is performed it is performed simultaneously by all 
processors, and since no other action is ever performed, there is no need for processors 
to continue sending messages after performing actions in runs of % in this model. We 
can therefore further optimize the communication of % by having processors halt after 
performing a simultaneous action. As a result, the following is an optimal protocol for 
any implementable simultaneous choice problem C, an optimal protocol simpler than 
the protocol ¥: 


repeat every round 

send current view to every processor 
until C,enabled(a;) holds for some a,; 
j —min{it : Cyenabled(a;) holds}; 
perform a;; 
halt. 


The fact that in the omissions model the information in V(p, £) is essentially all that 
is common knowledge at a given point has interesting implications about the type of 
simultaneous actions that can be performed in this model. For example, recall that in 
the traditional simultaneous Byzantine agreement or consensus problems (cf. [PSL], [F], 
[DM]), the processors are only required to decide, say, v in case they all start with an 
initial value of v. It would be more pleasing, however, if they could decide v whenever 
the majority of initial values are v. This is clearly impossible, since some processors may 
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be silent throughout the run. However, consider a protocol for simultaneous Byzantine 
agreement which is similar to ¥,, except that when some enabled(a;) becomes common 
knowledge (which happens exactly when V becomes non-empty), the processors choose 
the value that appears in the majority of the initial values recorded in V (p, £) as their 
decision value. In this case, the processors actually approximate majority fairly well: 
If more than (n + t)/2 of the initial values are v, then v will be chosen. In fact, we can 
show that the approximation is bad only in runs in which agreement is obtained early. 
In particular, if agreement cannot be obtained before time ¢ + 1 (this would happen 
in runs p for which V (p, 2) contains only empty views for every £ < t), then the value 
agreed upon would be the majority value in case more than n/2-+ 1 of the processors 
have the same initial value. Furthermore, a weak protocol for (exact) majority does 
exist: A protocol that either decides that there was a failure or decides on the true 
majority value. 


Since messages from faulty processors can convey new information about the failure 
pattern, such messages do affect the construction. Therefore, the behavior of faulty 
processors, even after they have been discovered to be faulty, plays an important role 
in determining what facts become common knowledge and when. In the crash failure 
model, however, a failed processor does not communicate with other processors after 
its failing round and has little impact on what facts become common knowledge. This 
is an essential property of the omissions model operationally distinguishing it from the 
crash failure model. 


We note, however, that all of the analysis in this subsection applies to the crash 
failure model, with all of the proofs applying verbatim when restricted to the crash 
failure model. We thus have: 


Proposition 24: In the crash failure model, (p,2) — Cy iff it is the case that 
(o', 2) & @ for all p' satisfying V(p, 2) = V(9’, 2). 


Thus, the set V (0,2) completely characterizes what facts are common knowledge at 
the point (p, 4) in the crash failure model as well. Since the same proofs show that the 
construction characterizes the connected components of the similarity graph in both 
the omissions and the crash failure model, the similarity graph in the omissions model is 
simply an extension of the similarity graph in the crash failure model, maintaining the 
same connected components. This implies that in a run of the omission model having 
a failure pattern consistent with the crash failure model, exactly the same facts about 
the input and the existence of failures are common knowledge at any given time in 
both the crash failure and the omissions model. (However, as a result of the difference 
in the types of failures possible in the two failure models, different facts about the 
failure pattern are common knowledge at the corresponding points.) Ruben Michel has 
independently characterized the similarity graph in variants of the crash failure model 
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(cf. [Mi]). For the crash failure model itself, he has an alternative construction that 
also characterizes the connected components of the similarity graph. 


As in the omissions model, it follows from Proposition 24 that our construction can 
be used to derive efficient optimal protocols for simultaneous choice problems in the 
crash failure model, thus slightly extending [DM]. We therefore have the following: 


Corollary 25: Let C be an implementable, practical simultaneous choice. In the 
crash failure model, there is a polynomial-time optimal protocol for C. 


As a final remark, let k; and G; be the intermediate results of beginning the con- 
struction at the point (p,¢), and denote v(G;,p,k;) by V;. Consider the operator € 
defined by €(V;) = Vi41 for all ¢. We find it interesting that V, which is the greatest 
fixpoint of the operator €, characterizes the facts » for which Cyp holds, where we 
know from [HM] that Cy is the greatest fixpoint of X =pAE,X. 


6.2 Receiving Omissions 


In the omissions model, faulty processors fail only to send messages. In this subsection, 
we consider the symmetric recetuing omisstons model, in which faulty processors fail 
only to receive messages. While at first glance these models seem very similar, they 
are actually extremely different. In particular, we will see that testing for common 
knowledge in this model becomes trivial. As a result, there are simple, efficient optimal 
protocols for practical simultaneous choice problems in this model. 


One intriguing difference between the omissions model and the receiving omissions 
model is the following. We have seen in the omissions model that in some cases a fact 
(for example, the arrival of a “start” signal) does not become common knowledge until 
as many as t+ 1 rounds after it is first determined to hold. Intuitively, the attainment 
of common knowledge is delayed by the possibility that a processor might fail to send a 
message determining that the fact holds. However, in the receiving omission model even 
faulty processors send all message required by the protocol. Since nonfaulty processors 
receive all messages sent to them, in runs of a full-information protocol all nonfaulty 
processors have a complete view of the first k rounds at time k +1. We can thus show 
the following: 


Theorem 26: Let ~ be a fact about the first k rounds. In the receiving omissions 
model, (p,k) — 9 iff (9,k +1) EF Cyy. 


The proof of this result. depends on the notion of a fact being valid at time k: a fact 


is said to be valid (in the system) at time k if for all runs p we have (p,k) — y. We 
remark that the following variant of the induction rule holds: 
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If ypoE,e is valid at time k, 
then oO DC;9 is valid at time k. 


Proof: Since y is a fact about the first k rounds, (p,k) | ¢ iff (p,k+1) — y. Thus, 
it is enough to show that (p,k+1) — iff (p,k+1) E Cup. Clearly, (p,k +1) F Cup 
implies (p,k + 1) —E y. Conversely, suppose (p,k +1) F y. During round k +1 in p 
every processor sends its entire view to all processors, so at time k + 1 all nonfaulty 
processors have a complete view of the first k rounds of p. Since is a fact about 
the first k rounds, (p,k +1) - Eyg. We have just shown that p > Ey is valid at 
time k + 1, so y D Cy¢ is valid at time k + 1 as well. Thus, (p,k + 1) E © implies 
(p,k +1) F Cyy. 


As a consequence of Theorem 26, efficient optimal protocols for practical simulta- 
neous choice problems are very simple in this model. 


Corollary 27: Let C be an implementable, practical simultaneous choice. In the 
receiving omissions model, there is a polynomial-time optimal protocol for C. 


Proof: We claim that (p,@) E Cyenabled(a,) iff G(p,@— 1) D enabled(a,) is valid in 
the system. Since C is a practical simultaneous choice problem, determining whether 
G(p,@—1) D enabled(a;) is valid in the system can be done in polynomial time. Since 
all nonfaulty processors know G(p,@— 1) at (p,@), this will yield a polynomial-time 
implementation of a test for common knowledge of enabled(a;). Thus, Theorem 7 will 
imply that % is a polynomial-time optimal protocol for C. Now, suppose G(p,é—1) D 
enabled(a;) is valid in the system. Theorem 26 implies that G(p,@— 1) is common 
knowledge at (p,2), and it follows that (p,£) F Cyenabled(a;). Conversely, suppose 
(p,2) F Cyenabled(a;). Let ¢ be a run satisfying G(p,@—1). A proof similar to the 
base case of Lemma 11 shows that (p,@) ~ (¢,£). Since (p,£) FE Cyenabled(a;), it 
follows that (¢,2)  enabled(a;). Thus, G(p,@—1) > enabled(a;) is valid in the system, 
as desired. OC) 


The results of this section point out a number of interesting differences between 
the omissions model and the receiving omissions model. For example, consider the dis- 
tributed firing squad problem. First, Theorem 26 implies that all nonfaulty processors 
are able to fire in the receiving omission model exactly one round after the first “start” 
signal is received. Recall that in the omissions model, firing may delayed as many as 
t+ 1 rounds. Second, since a faulty processor p might fail to receive all messages, it 
is not possible to guarantee that p will ever fire when a “start” signal is received by 
a nonfaulty processor. In the omissions model we have shown that it is possible to 
guarantee that all processors perform any action performed by the nonfaulty proces- 
sors. Finally, notice that faulty processors may sometimes be unable to halt, even after 
the nonfaulty processors have fired: A processor p receiving no messages or “start” 
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signals can never halt since at every point it is possible it will be the only processor 
in the system to receive a “start” signal. In this case, optimal protocols must require 
the nonfaulty processors to fire one round later, and hence p must be able to send 
this information to the nonfaulty processors. In contrast, in the omissions model it is 
possible to guarantee that all processors halt as soon as an action is performed in the 
system. These remarks show that while at first glance the assignment of responsibil- 
ity for undelivered messages to sending or to receiving processors may seem arbitrary, 
the assignment has a dramatic effect on when facts become common knowledge, and 
hence on the behavior of optimal protocols. Since such a simple modification of the 
omissions model results in the collapse of the combinatorial structure underlying the 
model (witness Theorem 26), we consider this to be an indication that the omissions 
model is not a robust model of failure. 


6.3 Generalized Omissions 


We have just seen that whether sending or receiving processors are responsible for 
undelivered messages has a dramatic effect on the structure of the omissions model. 
Perhaps a more natural model of failure is the generalized omisstons model, in which 
a faulty processor may fail both to send and to receive messages. This section is con- 
cerned with the design of optimal protocols for simultaneous choice problems in this 
model. We have seen that Theorem 7 implies the protocol ¥% is an optimal proto- 
col in this model, and that Theorem 10 implies this protocol can be implemented in 
polynomial-space. As in previous sections, the remaining question is whether there 
are efficient optimal protocols in this model. The fundamental result of this section 
is that testing for common knowledge in the generalized omissions model in NP-hard. 
Using the close relationship between common knowledge and simultaneous actions, we 
obtain as a corollary that optimal protocols for most any simultaneous choice problem 
in this model require processors to perform NP-hard computations. Consequently, for 
example, in this model there can be no efficient optimal protocol for simultaneous Byz- 
antine agreement or the distributed firing squad problem. This is a dramatic difference 
between the generalized omissions model and the more benign failure models, where, 
as we have seen, efficient optimal protocols do exist. 


One important difference between the generalized omissions model and simpler vari- 
ants of the omissions model is that in the generalized omissions model undelivered 
messages do not necessarily identify the set of faulty processors, but merely place con- 
straints on their possible identities: Either the sender or the intended receiver of every 
undelivered message must be faulty. The faulty processors must therefore induce a 
“vertex cover” of the undelivered messages. Recall that in our analysis of the omis- 
sions failure model, determining the number and the identity of the faulty processors 
given the labeled communication graph of a point played a crucial role in characterizing 
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the facts that are common knowledge at a point. In that model, a processor is known 
to be faulty iff it is known that a message it was supposed to send was not delivered, 
a fact easily determined from the labeled communication graph. In the generalized 
omissions model, however, even determining the number (and not necessarily the iden- 
tities) of processors implicitly known to be faulty essentially involves computing the 
size of the minimal vertex cover of a graph, a problem known to be NP-complete (cf. 
{GJ]). It is with this intuition that we now proceed to show that determining whether 
certain facts are common knowledge is computationally prohibitive in the generalized 
omissions model, assuming P¢NP. 


However, in order to study the complexity of testing for common knowledge in the 
generalized omissions model in a meaningful way, we are once again faced with the need 
to restrict our attention to a class of facts that includes all of the facts that may arise 
in natural simultaneous choice problems, and excludes anomalous cases. For example, 
if y is valid in the system, then so is C,p, and testing whether y is common knowledge 
is a trivial task. On the other hand, one can imagine facts involving excessive com- 
putational complexity of a type irrelevant to simultaneous choice problems. Consider, 
for instance, a fact » with the property that the communication graph of any point 
satisfying » encodes information allowing the solution of all problems in NP of size 
smaller than the number of processors in the system. Whereas it seems unlikely that 
such a fact exists, this fact is probably very hard to prove, and it is definitely not the 
business of this paper to do. We are therefore led to make the following restriction. 
A fact ~ is said to be admissible within a class of systems running a full-information 
protocol if (i) for all systems within this class neither » nor —¢ is valid in the system, 
and (ii) there is a polynomial-time algorithm explicitly constructing for each system 
a labeled communication graph G(p,£) of minimal length having the property that 
G(p,£) D ¢ is valid in the system. We say that a simultaneous choice problem C is 
admissible if each condition enabled(a;) is admissible within the class of systems deter- 
mined by a full-information protocol and C. We claim that any natural simultaneous 
choice is admissible. We can now state the fundamental result of this section which 


says, loosely speaking, that testing for common knowledge of admissible facts ¢,,...,9s 
is NP-hard. 
Lemma 28: Let 9,,...,9, be admissible practical facts within a class of systems 


running a full-information protocol in the generalized omissions model. Given the graph 
G(p,2£) of a point in such a system with n > 2t, the problem of determining whether 
(p, 2) FE V; Cy: is NP-hard. 


The proof of Lemma 28 will follow shortly. Notice, however, that ¢ is variable in the 
statement of this lemma, and in general may be O(n). The proof of this result will not 
apply for a fixed t, nor to cases in which t is restricted, say, to be O(lg). In any case, it 
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will follow that any standard implementation of our optimal knowledge-based protocols 
must be computationally intractable, unless P=NP. It is natural to ask whether this 
inefficiency is merely the result of having programmed our protocols using tests for 
common knowledge. It is conceivable, for instance, that there are optimal protocols 
for admissible simultaneous choice problems in the generalized omissions model that 
are computationally efficient. Intuitively, however, in order to perform a simultaneous 
action, an optimal protocol P must essentially determine whether any of the conditions 
enabled(a;) is common knowledge. Corollary 6 implies that such a condition becomes 
common knowledge during the corresponding run of a full-information protocol as soon 
as it does during a run of P. Thus, P must essentially determine whether such a fact 
is common knowledge during the corresponding run of a full-information protocol ¥. 
Since Lemma 28 implies that this problem is NP-hard, computing the function P must 
be NP-hard as well. We now make this argument precise. 


Recall that a protocol is formally a function mapping n, ¢, and a processor’s view 
to a list of the actions the processor should perform, followed by a list of the messages 
it is required to send in the following round. We say that a protocol is communtcation- 
effictent if in a system of n processors the size of the messages each processor is required 
to send during round @ is polynomial in n and @. In the following result we show that the 
problem of computing the function corresponding to a communication-efficient optimal 
protocol is NP-hard. Hence, no such protocol can be computationally efficient. 


Theorem 29: Let P be a communication-efficient, optimal protocol for an admissible, 
practical simultaneous choice C. The problem of computing (the function) P is NP- 
hard. 


Proof: Let © = {X(n,t) : n >t+ 2} be the class of systems determined by C and 
a full-information protocol. We will exhibit a Turing reduction from the problem of 
Lemma 28 to the problem of computing ?; that is, given the graph G(p, 2) of a point 
(p,@) of a system L(n,t) where n > 2t, we will show how to use P to determine in 
polynomial time whether (p, 4) FE V; Cy enabled(a,;). Since C is an admissible, practical 
simultaneous choice, each condition enabled(a;) must be an admissible, practical fact 
within ©. It follows by Lemma 28 that determining whether (p, 2) F V; Cy enabled(a;) 
is NP-hard. Thus, having exhibited the proposed Turing reduction, we will have shown 
that the problem of computing P is NP-hard. 


Notice that C must be implementable since P is a protocol for C. Thus, Theorem 7 
implies that % is an optimal protocol for C. Let p and ¢ be corresponding runs of ¥; 
and P, respectively. It follows from the definition of # that (p,£) F V; Cy enabled(a;) 
iff the nonfaulty processors perform a simultaneous action no later than time @ in p. 
Since ¥, and P are both optimal protocols for C, the nonfaulty processors perform 
simultaneous actions at the same times during p and ¢. Since n > 2t, there must be at 
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least t+ 1 nonfaulty processors in both runs, so the nonfaulty processors simultaneously 
perform an action no later than time @ in either run iff t+ 1 processors do so. Therefore, 


(p,£) FE V;C,enabled(a;) iff t + 1 processors perform a simultaneous action no later 
than time @ in ¢. 


One algorithm for determining whether t + 1 processors do perform a simultaneous 
action no later than time @ in ¢ is to construct the view of each processor in ¢ at each 
time k before time @, and use P to determine when processors are required to perform 
actions. Suppose we have constructed the view of each processor at time k—1 in ¢; let us 
consider the problem of constructing the view of a processor p at time k. Processor p’s 
view at (¢,) consists of p’s name, the time k, a list of the messages received by p during 
the first k rounds of ¢, and a list of the input received by p during the first k rounds of ¢. 
Recall that since p is a run of full-information protocol, the graph G(p, £) is actually an 
encoding of the operating environment during the first £2 rounds of p, and hence also 
of ¢. Given the views of all processors at time k — 1, the protocol P determines what 
messages each processor is required to send to p, and G(p,£) determines which of these 
messages are actually delivered to p. Since P is communication-efficient, each of these 
messages is of size polynomial in n and k. Furthermore, the input received by p during 
round k labels the node (p,k) of G(p,£). Since C is practical, this input is of constant 
size. Thus, given each processor’s view at time k — 1, we can use the graph G(p,@) and 
an oracle for P to construct the view of each processor at time k in polynomial time. 
(An oracle for P is an oracle that, given the view of a processor p at a point (p,@), in 
one step determines what actions P requires p to perform at time @, and constructs the 
messages P requires p to send during round ¢ + 1.) 


Consider the following algorithm: 


action_performed < false; 
k <0; 
repeat 
for all processors p do 
determine whether P requires p to perform any action at time k, and 
construct the messages P requires p to send during round k + 1; 
endfor 
if t+ 1 processors perform actions at time k 
then action_performed — true; 
k<-k+1; 
until k > 2 or actton_performed; 
if action_performed 
then halt with “yes” 
else halt with “no”. 


From the previous discussion it is clear that given any oracle for P, this algorithm 
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determines in polynomial time whether t + 1 processors perform actions simultaneously 
no later than time @ in ¢, and hence whether (p, ) & V,; Cy enabled(a,). O 


As an immediate corollary of Theorem 29, we have the following: 


Corollary 30: Let C be an admissible practical simultaneous choice problem. If there 
is a polynomial-time optimal protocol for C, then P=NP. 


Corollary 30 implies that optimal protocols for simultaneous choice problems as 
simple as the distributed firing squad problem or simultaneous Byzantine agreement 
are computationally infeasible in the generalized omissions model, assuming P # NP. In 
fact, we do not know whether these problems can be implemented in polynomial time 
even using an NP oracle. The best we can do in the generalized omissions model is im- 
plement them using polynomial-space computations, as in the proof of Theorem 10. We 
consider the question of determining the exact complexity of implementing admissible 
practical simultaneous choice problems in this model an interesting open problem. 


We now proceed to prove Lemma 28. First, however, we state a result that will 
be very useful in the proof of Lemma 28. Roughly speaking, it says that if a group of 
processors can (jointly) prove that they are nonfaulty, then their views become common 
knowledge at the end of the following round. 


Lemma 31: Let S be a set of processors and let S = P—S. Let p be arun ofa 
full-information protocol. If the processors in S implicitly know at (p,@— 1) that S 
contains ¢ faulty processors, then the joint view of S at (p,@—1) is common knowledge 


at (p, 2). 


Proof: Let 9 = “V is the joint view of S at time 2— 1”, where V = v(S,p,é— 1). 
Suppose (p',£) FE ». Given that S has the same (joint) view at (p,é— 1) and at 
(o',€—1), and since S implicitly knows at (p,€—1) that S contains ¢t faulty processors, 
S implicitly knows the same at (p',@—1). In particular, the processors in S must 
be nonfaulty in p’ and each must successfully send its view to all processors during 
round £ of p', and since all nonfaulty processors will receive these messages, we have 
(o',2) E Exp. It follows that p D> Ey is valid at time 2, and the induction rule 
implies > Cy is valid at time @ as well. Thus, (p, 2) — y implies (p,2) EF Cyp. O 


(We note in passing that a converse to Lemma 31 is also true: If the joint view at 
time £— 1 of a set S of processors is common knowledge at time @, then the processors 
in some set S'’ D> S must implicitly know at time £—1 that there are t faulty processors 
among the members of S’.) 


In addition to Lemma 31, the following result, analogous to Lemma 11 in the 
omissions model, will be of use in the proof of Lemma 28. 
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Lemma 32: Let p and p’ be runs differing only in the (faulty) behavior displayed by 


processor p after time k, and suppose no more that f processors fail in either p or p’. 
If£—k <t+1-f, then (p,2) ~ (',2). 


Proof: The proof is analogous to the proof of Lemma 11, with the added observation 
that if p sends no messages after (an arbitrary) time k in ¢, then (¢, 2) ~ (¢', 2) where ¢’ 
differs from ¢ in that p receives messages from an arbitrary set of processors during 
round k. O 


As we have already mentioned, the proof of Lemma 28 involves a reduction from 
the Vertex Cover problem (cf. [GJ]). This is the problem of determining, given a graph 
G = (V,£) and a positive integer k, whether G has a vertex cover of size k or less; that 
is, a subset V C V such that |V| < k and, for each edge {v,w} € E, at least one of v 
and w belongs to V. 


Theorem [Karp]: Vertex Cover is NP-complete. 


We now prove Lemma 28. 


Proof of Lemma 28: We will exhibit a Turing reduction from Vertex Cover to 
the problem of testing for common knowledge of ,,...,¢,, and it will follow that this 
problem is NP-hard. Since every graph G = (V, E) is |V |-coverable, the following is an 
algorithm for Vertex Cover: 


m+ |V|; 

while G has no vertex cover of size m— 1 do 
m+<-m—1; 

ifm<k 
then return “G has a vertex cover of size k” 
else return “G has no vertex cover of size k”. 


To implement this test, it is enough to implement a test that, given an m-coverable 
graph G, determines whether G is not (m — 1)-coverable. Every graph G = (V, E) 
clearly has a vertex cover of size |V| — 1. In addition, it is possible to determine 
whether G has a vertex cover of size |V| - 2 in polynomial time. Similarly, it is easy 
to determine whether G has a vertex cover of size 0 in polynomial time. We show that 
if 1 < m< |V|-2 and G is m-coverable, then it is possible to construct in polynomial 
time a graph G(p,@) such that (p, 2) E V; Cy; iff G is not (m—1)-coverable. The point 
(p, 2) will be a point of asystem U(n,t) with n > 2t from the class under consideration. 
Thus, given an oracle for testing for common knowledge of ,,...,(9,, we will have a 
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Figure 5: Embedding a graph G in a run p. 


polynomial-time test for the (m — 1)-coverability of G. It will follow that testing for 
common knowledge of ~,,...,%, is NP-hard. 


Fix a graph G = (V,£) and an integer m satisfying 1 < m < |V| —2. Let n = 
|\V|+m+3 andt = m+2, and let X(n,t) be a system from the class under consideration. 
Notice that since |V| > m+ 2, we have n > 2t. Since each fact y; is admissible, we 
can explicitly construct in polynomial-time a labeled communication graph (of a point 
in X(n,t)) of minimal length determining y;. Of these graphs, let G be one of minimal 
length, say of length k. Let p be a run of U(n,t), illustrated in Figure 5, satisfying 
the following conditions: (i) the input received in the first k rounds of p is the same 
as in G, and no input is received after time k; (ii) all messages in the first k rounds 
are delivered; (iii) in round k +1, the only undelivered messages are as follows: no 
message is delivered from processor p, to py in round k + 1 of p iff there is an edge 
from v to w in G (that is, the graph G is represented by the undelivered messages 
during round k + 1); (iv) two additional processors f; and f, are silent from time k+1 
in p, and all other messages after time k + 1 are delivered; and (v) a set S of (+1 
additional processors do not fail in p. Since G has a vertex cover V of size m, one failure 
pattern consistent with the undelivered messages in p is that p, is faulty for every v € V 
(accounting for the undelivered messages during round k + 1 of p) and that both f; 
and f, are faulty. Given that t = m+ 2 processors fail in this failure pattern, there 
is a run p of X(n,t) satisfying the required conditions. Since the graph G determining 
the input of G(p,k) can be constructed in polynomial time, setting 2 = k + 3, the 
graph G(p,@) can be constructed in polynomial time as well. It remains to show that 
(0,2) E V; Cuy; iff G is not (m — 1)-coverable. 


Suppose G has no vertex cover of size m — 1, and let F be the set of processors 
failing in p. Since f; and fz must be faulty (each fails to the t+ 1 processors in S), 


a {fi,f2} must account for every undelivered message during round k + 1. 
If there is an edge from v to w in G, then no message from p, to py is delivered in 
round k+1, and one of p, or p, must be in F’. It follows that F’ must induce a vertex 
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cover of G. Since G has no vertex cover of size m — 1, F’ must contain at least m 
processors, and F at least t = m+2. Thus, the processors in S implicitly know at time 
k +2 that their complement S = P — S contains t faulty processors. By Lemma 31, 
their views at time k + 2 must be common knowledge at time k + 3. These views 
contain a complete description of G(p,k), and hence the identity of G(p,k) is common 
knowledge at (p,£). Recall that G was chosen to be a graph determining y; for some t. 
If G does not specify a failure, then G(p,k) = G, and it follows that (p,2) F Cyy,;. On 
the other hand, if G does specify a failure, then y, is determined by the input to the 
first k rounds of G and the existence of a failure. Notice that the failure of f; and fy, is 
also recorded in the view of S at time & + 2, and hence is also common knowledge at 
(p, 2). Thus, the existence of a failure is common knowledge at time @, and it follows 
that (p, 2) F Cyy;. In either case, we have (p, 2) E V; Cy. 


Conversely, suppose G does have a vertex cover of size m—1. Without loss of 
generality, at most t — 1 processors fail in p. First, we claim that (p, 2) ~ (¢,@) where ¢ 
is a failure-free run with the input of p. Since f; and f2 fail only after time k+1 = £—2, 
two applications of Lemma 32 imply that (p, 2) ~ (o', 2) where p’ differs from p in that f; 
and f, do not fail in p'’. Since at most ¢t — 3 processors fail in p' and k = @ — 3, by 
Lemma 32 we have (p', 2) ~ (¢, 2). Second, we claim that for each y; there is a run n; 
not satisfying y,; that differs from G only after time k —1. If k = 0, then since 9; 
is admissible and hence not valid in the system, such a run must certainly exist. On 
the other hand, if k > 0, then since G was chosen to be a labeled communication 
graph of minimal length determining y; for some y;, such a run must exist in this 
case as well. Now, let »! be a run having the input of ;, in which no processor 
fails before time @, and in which processors become silent after time @ iff there is a 
failure in n,;. Since y; is a fact about the input and existence of failures, and since 7; 
does not satisfy ;, neither does n!. Let ¢ and #/ be runs of ¥ in the omissions model 
having the operating environment of ¢ and nj, respectively. (Notice that these operating 
environments are actually operating environments of the omissions model.) Notice that 
no processor fails before time @ in either ¢ or #/. It follows that G(é,£) = G(#!, 2), and 
that k(é, OS k (At, 2). We denote them by G and &, respectively. Since t = m+2 
and m > 1, we have that t > 3. Thus, k = é—(t+1) <@-4=k-1. Recall 
that ¢ and #{ have the same input (and no failures) through time k — 1. It follows 
that V(é¢,£) = V(#!,2). It follows by Lemma 19 that (¢,£) ~ (#!, 2) in the omissions 
model, and hence that (¢, 2) ~ (n}, 2) in the generalized omissions model as well. Since 
(p,£) ~ (¢,2), it follows that for each y; we have (p,£) ~ (ni,@) and (n!,£) Ko. 
Therefore, for each y; we have (p,£)  Cy;, and hence (p,£) - V; Cui. CJ 


We have seen that, as a result of the uncertainty about the failure pattern, the com- 
plexity of determining whether admissible facts are common knowledge is dramatically 
greater in this model than in more benign models. It is conceivable, however, that this 
gap in complexity is due to the fact that faulty processors may fail both to send and 
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to receive messages, and not merely due to the uncertainty about the failure pattern. 
We can show, however, that it is precisely due to this uncertainty that we observe this 
complexity gap. Consider the closely related failure model we have termed generalized 
omissions with information, a model differing from the generalized omissions model in 
that a processor not receiving a message can determine whether it or the sender is at 
fault. We can show that the construction used in the omissions model may also be 
used in this model to yield a set of views V (0, £) completely characterizing what facts 
are common knowledge at the point (p, £). 


Proposition 33: In generalized omissions with information, we have (p,2) EF Cye 
iff (p', 2) E y for all p’ satisfying V(p', 2) = V(p, 2). 


All of the proofs in the omissions model hold when generalized to this model, with 
the exception that the construction must be started with a nonfaulty processor. (In 
particular, Lemma 16 holds only when the processors p and g are processors that do not 
fail to receive messages.) This exception says that faulty processors may not be able to 
perform all actions performed by the nonfaulty processors, but this is no surprise since 
the same is true in the receiving omissions model. Furthermore, the computation of the 
sets B; in the construction now depends not only on the undelivered messages, but also 
on the additional information that receiving processors obtain regarding blame for the 
undelivered messages. As in the omissions model, this construction yields a method of 
deriving efficient tests for common knowledge of certain facts. Thus, it is again possible 
to design efficient optimal protocols: 


Theorem 34: Let C be an implementable practical simultaneous choice. In general- 
ized omissions with information, there is a polynomial-time optimal protocol for C. 


This shows that it is precisely the uncertainty about the failure pattern that is respon- 
sible for the observed gap in complexity, and not merely the fact that faulty processors 
may fail both by failing to send and to receive messages. 


The uncertainty about the failure pattern in the generalized omissions model adds a 
new combinatorial structure to the similarity graph in this model that does not exist in 
other variants of the omissions model. Since it is possible to assign failure to processors 
in a number of different ways consistent with a pattern of undelivered messages, it is 
possible to play “pebbling games” with the failure pattern when constructing paths in 
the similarity graph, showing that one point is similar to another point by alternatively 
assigning responsibility for undelivered messages to the sender and to the receiver. In 
fact, in addition to increasing the difficulty of determining whether a fact is common 
knowledge at a point, the following theorem shows that the this new combinatorial 
structure has interesting effects on when facts become common knowledge. 
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Theorem 35: In the generalized omissions model: 


a. If nm < 2¢ then the only facts that are common knowledge at time 2 are facts valid 
at time 2. 


b. If n > 2t then some facts not valid at time 2 do become common knowledge at 
time 2. 


Proof: For part (a), it is enough to show that (p,2) ~ (',2) for all runs p and p’. 


First we show that (p,2) ~ (7,2) where 7 is the failure-free run with the input 
of p, and that (p',2) ~ (n',2) for the analogous failure-free run n' with the input of p’. 
Suppose that B and G are the sets of faulty and nonfaulty processors, respectively, in p. 
Without loss of generality, we may assume that |B| < ¢t and |G| < t. Using Lemma 32, 
we see that (p,2) ~ (¢,2) where ¢ differs from p in that processors in B are silent 
during round 2 of ¢. The view of G at (¢,2) is independent of the view of B at (¢,1), 
so (¢,2) ~ (¢’,2) where ¢’ differs from ¢ in that processors in B receive messages from 
all processors during round 1 of ¢’. Again, using Lemma 32, we see that (¢',2) ~ (¢”, 2) 
where ¢” differs from ¢’ in that processors in B do not fail in round 2 of ¢”. The only 
failures remaining in ¢" are failures of processors in B to send to processors in G during 
round 1. It is therefore possible to reverse the roles of B and G in this argument to 
show that (¢",2) ~ (7,2) where 7 is the failure-free run with the input of ¢" (and of 
hence p). By the transitivity of “~,” (p,2) ~ (7,2). An analogous argument shows 
that (p',2) ~ (n’,2). 

Now we show that (7,2) ~ (n',2) for all failure-free runs 7 and 7’. Silence the 
processors in a set B of t processors during the run 7 to yield a run ¢, and silence 
the processors in the set B’ = P — B of the remaining processors during the run 7’ to 
yield a run ¢’. By the previous paragraph, (7,2) ~ (¢,2) and (n',2) ~ (¢',2). Change 
the input received by processors in B during the run ¢ to that received in ¢' to yield 
a run ¢, and change the input received by processors in B’ during the run ¢’ to that 
received in ¢ to yield a run €’. Clearly, (¢,2) ~ (€,2) and (¢',2) ~ (¢',2). In addition, 
by the previous paragraph, (€,2) ~ (é',2). Thus, (7,2) ~ (n',2), as desired. 


For part (b), let B = {p,,...,p:} and G = {pi4i,.--,; Pn}. Let yo be the fact “each 
processor in B fails to send to every processor in G during round 1.” We show that 
yp > Cy is valid at time 2. Notice that -p D 7ACyp and ~Cyp D Cy(ACyy) are 
valid. Since, as we shall soon show, » D Cy is valid at time 2, so are ACyyp D Ay 
and Cy(=Cyp) > C,(-~). It follows that ~p > C,y(-~) is also valid at time 2. Thus, 
either » or =~ is common knowledge at time 2, yet neither is valid at time 2. Now, 
let p be a run satisfying ». Since n > 2t, there are at least t + 1 processors in G, and 
none of them receives a message from any of the t processors in B during round 1,so G 
implicitly knows at time 1 that B contains t faulty processors. By Lemma 31, the joint 
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view of the processors in G at time 1 is common knowledge at time 2. Since » follows 
from the joint view of G at time 1, y is common knowledge at time 2. Thus, y D Cyp 
is valid at time 2. C] 


As a result of Theorem 35, when n < 2¢ no nontrivial simultaneous choice can 
be performed at time 2 in this model. We remark that this is the first evidence of 
behavior in such a benevolent failure model depending on the ratio of n and ¢t. In 
addition, Theorem 35 tells us that nonvalid facts do not become common knowledge 
at time 2 if n < 2t, yet we know that such facts do become common knowledge in 
more benign models. As a consequence, protocols that are optimal in the generalized 
omissions model will not be optimal in the omissions model, or even in the generalized 
omissions model with information. 


The generalized omissions model therefore seems to be a natural failure model that 
already displays some of the complex behavior of the more malicious models, while still 
involving only benevolent processors that faithfully follow their protocols. We believe 
that this model is therefore a natural candidate for further study as an intermediate 
model on the way to understanding the mysteries of fault tolerance in truly malicious 
failure models. 


7 Conclusions 


This paper applies the theory of knowledge in distributed systems to the design and 
analysis of fault-tolerant protocols for a large and interesting class of problems. This is 
a good example of the power of applying reasoning about knowledge to obtain general, 
unifying results and a high-level perspective on issues in the study of unreliable systems. 
We believe that reasoning about knowledge will continue to be an effective tool in 
studying the basic structure and the fundamental phenomena in a large variety of 
problems in distributed computing. 


Given the effectiveness of a knowledge-based analysis in the case of simultaneous 
actions (see also [DM)]), it would be interesting to know whether such an analysis can 
shed similar light on the case of eventually coordinated actions. Dolev, Reischuk, and 
Strong show that the problem of performing eventually coordinated actions in such 
synchronous systems is quite different from that of performing simultaneous actions 
(cf. [DRS]). In addition to common knowledge, an analysis of eventually coordinated 
actions may be able to make good use of the notion of eventual common knowledge 
(cf. [HM], [Mo]). We note that it is possible to show that for eventual choice problems 
there do not, in general, exist protocols that are optimal in all runs. For example, 
one can give two protocols for (eventual) Byzantine agreement with the property that 
for every operating environment one of these protocols will reach Byzantine agreement 
(i.e., all processors will decide on a value) by time 2 at the latest. However, if ¢ > 1, 
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it is well known that no single protocol can guarantee that agreement will be reached 
by time 2 in all runs. What is the best notion of optimality that can be achieved in 
eventual coordination? 


We provide a method of deriving an optimal protocol for any given implementable 
specification of a simultaneous choice problem. However, in this work, we have com- 
pletely sidestepped the interesting question of characterizing the problems that are and 
are not implementable in different failure models. We believe that a general analysis 
of the implementability of problems involving coordinated actions in different failure 
models will expose many of the important operational differences between the mod- 
els. As an example, our specification of the distributed firing squad problem in the 
introduction is implementable in the variants of the omissions model, but ts not imple- 
mentable in more malevolent models, in which a faulty processor can falsely claim to 
have received a “start” message and otherwise seem to behave correctly (see [BL] and 
[CDDS] for definitions of versions of the firing squad problem that are implementable 
in the more malicious models). 


In the generalized omissions model, we have shown how to derive optimal pro- 
tocols for nontrivial simultaneous choice problems, requiring processors to perform 
polynomial-space computations between consecutive rounds. We have also shown an 
NP-hard lower bound for any communication-efficient protocol for such a problem that 
is optimal in all runs. Determining the precise complexity of this task is a nontrivial 
open problem, due to the interesting combinatorial structure underlying the generalized 
omissions model. It would also be interesting to extend our study to more malicious 
failure models, such as the Byzantine and the authenticated Byzantine models (cf. [F]). 
It is not immediately clear whether the notion of a failure pattern can be defined in 
these models in a protocol-independent fashion. Thus, it is not clear that the notion 
of optimality in all runs is well defined in such models. If such definitions are possible, 
we believe that the NP-hardness result from the generalized omissions model should 
extend to these models. (Our proof does show that testing for common knowledge in 
runs of the full-information protocol ¥ in both models is NP-hard.) Capturing the pre- 
cise combinatorial structure of the similarity graph in these models is bound to expose 
many of the mysterious properties of the models. We believe that this is an important 
first step in understanding these models. 


As we have seen, there are no computationally-efficient optimal protocols for simul- 
taneous choice problems in the generalized omissions model. Since it is unreasonable 
to expect processors to perform NP-hard computations between consecutive rounds of 
communication, it is natural to ask what is the earliest time at which such actions can 
be performed by resource bounded processors (e.g., processors that can perform only 
polynomial-time computations). Are there always guaranteed to be optimal protocols 
for such processors? How can they be derived? The analysis of this question is no 
longer as closely related to the question of when facts about the run become common 
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knowledge. It seems that the information-based definition of knowledge that we pre- 
sented in Section 3, used in many other papers as the definition of knowledge in a 
distributed system (cf. [CM], [DM], [FI], [HM], and [PR]), is not appropriate for rea- 
soning about such questions. A major challenge motivated by this is the elaboration 
of the definition of knowledge presented in Section 3 to include notions of resource- 
bounded knowledge that would provide us with appropriate tools for analyzing such 
questions. Such a theory would provide notions such as polynomial-time knowledge and 
polynomial-time common knowledge, which would correspond to the actions and the 
simultaneous actions that polynomial-time processors can perform. Note that the fact 
that (suboptimal) polynomial-time protocols for the simultaneous Byzantine agreement 
problem exist even in the more malicious failure models imply that, given the right no- 
tions, many relevant facts should become polynomial-time common knowledge. Much 
work is left to be done. 
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